<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Justin Domke's Weblog</title>
	<atom:link href="http://justindomke.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://justindomke.wordpress.com</link>
	<description></description>
	<lastBuildDate>Thu, 05 Jan 2012 14:22:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='justindomke.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Justin Domke's Weblog</title>
		<link>http://justindomke.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://justindomke.wordpress.com/osd.xml" title="Justin Domke&#039;s Weblog" />
	<atom:link rel='hub' href='http://justindomke.wordpress.com/?pushpress=hub'/>
		<item>
		<title>CRF Toolbox Updated</title>
		<link>http://justindomke.wordpress.com/2012/01/04/crf-toolbox-updated/</link>
		<comments>http://justindomke.wordpress.com/2012/01/04/crf-toolbox-updated/#comments</comments>
		<pubDate>Wed, 04 Jan 2012 20:21:28 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ai]]></category>
		<category><![CDATA[computer vision]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[matlab]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=1140</guid>
		<description><![CDATA[I updated the code for my Graphical Models / Conditional Random Fields toolbox This is a Matlab toolbox, though almost all the real work is done in compiled C++ for efficiency. The main improvements are: Lots of bugfixes. Various small improvements in speed. A unified CRF training interface to make things easier for those not [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1140&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I updated the code for my <a href="http://phd.gccis.rit.edu/justindomke/JGMT/">Graphical Models / Conditional Random Fields toolbox</a>  This is a Matlab toolbox, though almost all the real work is done in compiled C++ for efficiency. The main improvements are:</p>
<ul>
<li>Lots of bugfixes.</li>
<li>Various small improvements in speed.</li>
<li>A unified CRF training interface to make things easier for those not training on images</li>
<li>Binaries are now provided for Linux as well as OS X.</li>
<li>The code for inference and learning using TRW is now multithreaded, using <a href="http://www.openmp.org/">openmp</a>.</li>
<li>Switched to using a newer version of Eigen</li>
</ul>
<p>There is also far more detailed examples, including a full tutorial of how to train a CRF to do &#8220;semantic segmentation&#8221; on the <a href="http://dags.stanford.edu/projects/scenedataset.html">Stanford Backgrounds dataset</a>. Just using simple color, position, and Histogram of Gradient features, the error rates are 23%, which appear to be state of the art (and better than previous CRF based approaches.)  It takes about 90 minutes to train on my 8-core machine, and processes new frames in a little over a second each.</p>
<p><img src="http://phd.gccis.rit.edu/justindomke/JGMT/marginals_backgrounds_small.png" alt="" /></p>
<p>For fun, I also ran this model on a video of someone driving from Alexandria into Georgetown. You can see that the results are far from perfect but are reasonably good.  (Notice it successfully distinguishes trees and grass at 0:12)</p>
<span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='380' height='244' src='http://www.youtube.com/embed/j-5P_yjPhso?version=3&amp;rel=1&amp;fs=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent' frameborder='0'></iframe></span>
<p>I&#8217;m keen to have others use the code (what with the hundreds of hours spent writing it), so please send email if you have any issues.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/1140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/1140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/1140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/1140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/1140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/1140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/1140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/1140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/1140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/1140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/1140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/1140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/1140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/1140/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1140&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2012/01/04/crf-toolbox-updated/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>

		<media:content url="http://phd.gccis.rit.edu/justindomke/JGMT/marginals_backgrounds_small.png" medium="image" />
	</item>
		<item>
		<title>Personal opinions about graphical models 1: The surrogate likelihood exists and you should use it.</title>
		<link>http://justindomke.wordpress.com/2011/11/01/personal-opinions-about-graphical-models-1-the-surrogate-likelihood-exists-and-you-should-use-it/</link>
		<comments>http://justindomke.wordpress.com/2011/11/01/personal-opinions-about-graphical-models-1-the-surrogate-likelihood-exists-and-you-should-use-it/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 19:29:09 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[graphical models]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[probabilistic inference]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=1071</guid>
		<description><![CDATA[When talking about graphical models with people  (particularly computer vision folks) I find myself advancing a few opinions over and over again.  So, in an effort to stop bothering people at conferences, I thought I&#8217;d write a few entries here. The first thing I&#8217;d like to discuss is &#8220;surrogate likelihood&#8221; training.  (So far as I [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1071&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>When talking about graphical models with people  (particularly computer vision folks) I find myself advancing a few opinions over and over again.  So, in an effort to stop bothering people at conferences, I thought I&#8217;d write a few entries here.</p>
<p>The first thing I&#8217;d like to discuss is &#8220;surrogate likelihood&#8221; training.  (So far as I know, Martin Wainwright was the first person <a href="http://jmlr.csail.mit.edu/papers/volume7/wainwright06a/wainwright06a.pdf">to give a name</a> to this method.)</p>
<h3>Background</h3>
<p>Suppose we want to fit a Markov random field (MRF).  I&#8217;m writing this as a generative model with an MRF for simplicity&#8211; pretty much the same story holds with a Conditional Random Field in the discriminative setting.</p>
<p><img src='http://s0.wp.com/latex.php?latex=p%28%7B%5Cbf+x%7D%29+%3D+%5Cfrac%7B1%7D%7BZ%7D+%5Cprod_%7Bc%7D+%5Cpsi%28%7B%5Cbf+x%7D_c%29+%5Cprod_i+%5Cpsi%28y_i%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='p({&#92;bf x}) = &#92;frac{1}{Z} &#92;prod_{c} &#92;psi({&#92;bf x}_c) &#92;prod_i &#92;psi(y_i)' title='p({&#92;bf x}) = &#92;frac{1}{Z} &#92;prod_{c} &#92;psi({&#92;bf x}_c) &#92;prod_i &#92;psi(y_i)' class='latex' /></p>
<p>Here, the first product is over all cliques/factors in the graph, and the second is over all single variables.  Now, it is convenient to note that MRFs can be seen as members of the exponential family</p>
<p><img src='http://s0.wp.com/latex.php?latex=p%28%7B%5Cbf+x%7D%3B%7B%5Cboldsymbol+%5Ctheta%7D%29+%3D+%5Cexp%28+%7B%5Cboldsymbol+%5Ctheta%7D+%5Ccdot+%7B%5Cbf+f%7D%28%7B%5Cbf+x%7D%29+-+A%28%7B%5Cboldsymbol+%5Ctheta%7D%29+%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='p({&#92;bf x};{&#92;boldsymbol &#92;theta}) = &#92;exp( {&#92;boldsymbol &#92;theta} &#92;cdot {&#92;bf f}({&#92;bf x}) - A({&#92;boldsymbol &#92;theta}) )' title='p({&#92;bf x};{&#92;boldsymbol &#92;theta}) = &#92;exp( {&#92;boldsymbol &#92;theta} &#92;cdot {&#92;bf f}({&#92;bf x}) - A({&#92;boldsymbol &#92;theta}) )' class='latex' />,</p>
<p>where</p>
<p><img src='http://s0.wp.com/latex.php?latex=%7B%5Cbf+f%7D%28%7B%5Cbf+X%7D%29%3D%5C%7BI%5B%7B%5Cbf+X%7D_%7Bc%7D%3D%7B%5Cbf+x%7D_%7Bc%7D%5D%7C%5Cforall+c%2C%7B%5Cbf+x%7D_%7Bc%7D%5C%7D%5Ccup%5C%7BI%5BX_%7Bi%7D%3Dx_%7Bi%7D%5D%7C%5Cforall+i%2Cx_%7Bi%7D+%5C%7D+&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='{&#92;bf f}({&#92;bf X})=&#92;{I[{&#92;bf X}_{c}={&#92;bf x}_{c}]|&#92;forall c,{&#92;bf x}_{c}&#92;}&#92;cup&#92;{I[X_{i}=x_{i}]|&#92;forall i,x_{i} &#92;} ' title='{&#92;bf f}({&#92;bf X})=&#92;{I[{&#92;bf X}_{c}={&#92;bf x}_{c}]|&#92;forall c,{&#92;bf x}_{c}&#92;}&#92;cup&#92;{I[X_{i}=x_{i}]|&#92;forall i,x_{i} &#92;} ' class='latex' /></p>
<p>is a function consisting of indicator functions for each possible configuration of each clique and variable, and the log-partition function</p>
<p><img src='http://s0.wp.com/latex.php?latex=A%28%5Cboldsymbol%7B%5Ctheta%7D%29%3D%5Clog%5Csum_%7B%7B%5Cbf+x%7D%7D%5Cexp%5Cboldsymbol%7B%5Ctheta%7D%5Ccdot%7B%5Cbf+f%7D%28%7B%5Cbf+x%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='A(&#92;boldsymbol{&#92;theta})=&#92;log&#92;sum_{{&#92;bf x}}&#92;exp&#92;boldsymbol{&#92;theta}&#92;cdot{&#92;bf f}({&#92;bf x})' title='A(&#92;boldsymbol{&#92;theta})=&#92;log&#92;sum_{{&#92;bf x}}&#92;exp&#92;boldsymbol{&#92;theta}&#92;cdot{&#92;bf f}({&#92;bf x})' class='latex' />.</p>
<p>ensures normalization.</p>
<p>Now, the log-partition function has the very important (and easy to show) property that the gradient is the expected value of <img src='http://s0.wp.com/latex.php?latex=%5Cbf+f&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bf f' title='&#92;bf f' class='latex' />.</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cfrac%7BdA%7D%7Bd%7B%5Cboldsymbol+%5Ctheta%7D%7D+%3D+%5Csum_%7B%5Cbf+x%7D+p%28%7B%5Cbf+x%7D%3B%7B%5Cboldsymbol+%5Ctheta%7D%29+%7B%5Cbf+f%7D%28%7B%5Cbf+x%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle &#92;frac{dA}{d{&#92;boldsymbol &#92;theta}} = &#92;sum_{&#92;bf x} p({&#92;bf x};{&#92;boldsymbol &#92;theta}) {&#92;bf f}({&#92;bf x})' title='&#92;displaystyle &#92;frac{dA}{d{&#92;boldsymbol &#92;theta}} = &#92;sum_{&#92;bf x} p({&#92;bf x};{&#92;boldsymbol &#92;theta}) {&#92;bf f}({&#92;bf x})' class='latex' /></p>
<p>With a graphical model, what does this mean?  Well, notice that the expected value of, say, <img src='http://s0.wp.com/latex.php?latex=I%5BX_i%3Dx_i%5D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='I[X_i=x_i]' title='I[X_i=x_i]' class='latex' /> will be exactly <img src='http://s0.wp.com/latex.php?latex=p%28x_i%3B%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='p(x_i;{&#92;boldsymbol &#92;theta})' title='p(x_i;{&#92;boldsymbol &#92;theta})' class='latex' />. Thus, the expected value of <img src='http://s0.wp.com/latex.php?latex=%7B%5Cbf+f%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='{&#92;bf f}' title='{&#92;bf f}' class='latex' /> will be a vector containing all univariate and clique-wise marginals.  If we write this as <img src='http://s0.wp.com/latex.php?latex=%7B%5Cboldsymbol+%5Cmu%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='{&#92;boldsymbol &#92;mu}({&#92;boldsymbol &#92;theta})' title='{&#92;boldsymbol &#92;mu}({&#92;boldsymbol &#92;theta})' class='latex' />, then we have</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cfrac%7BdA%7D%7Bd%7B%5Cboldsymbol+%5Ctheta%7D%7D+%3D+%7B%5Cboldsymbol+%5Cmu%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle &#92;frac{dA}{d{&#92;boldsymbol &#92;theta}} = {&#92;boldsymbol &#92;mu}({&#92;boldsymbol &#92;theta})' title='&#92;displaystyle &#92;frac{dA}{d{&#92;boldsymbol &#92;theta}} = {&#92;boldsymbol &#92;mu}({&#92;boldsymbol &#92;theta})' class='latex' />.</p>
<h3>The usual story</h3>
<p>Suppose we want to do maximum likelihood learning.  This means we want to set <img src='http://s0.wp.com/latex.php?latex=%7B%5Cboldsymbol+%5Ctheta%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='{&#92;boldsymbol &#92;theta}' title='{&#92;boldsymbol &#92;theta}' class='latex' /> to maximize</p>
<p><img src='http://s0.wp.com/latex.php?latex=L%28+%7B%5Cboldsymbol+%5Ctheta%7D+%29+%3D+%5Cfrac%7B1%7D%7BN%7D%5Csum_%7B%5Chat%7B%7B%5Cbf+x%7D%7D%7D%5Clog+p%28%7B%5Cbf+x%7D%3B%7B%5Cboldsymbol+%5Ctheta%7D%29%3D%7B%5Cboldsymbol+%5Ctheta%7D%5Ccdot%5Cfrac%7B1%7D%7BN%7D%5Csum_%7B%5Chat%7B%7B%5Cbf+x%7D%7D%7D%7B%5Cbf+f%7D%28%5Chat%7B%7B%5Cbf+x%7D%7D%29-A%28%7B%5Cboldsymbol+%5Ctheta%7D%29.&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='L( {&#92;boldsymbol &#92;theta} ) = &#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}&#92;log p({&#92;bf x};{&#92;boldsymbol &#92;theta})={&#92;boldsymbol &#92;theta}&#92;cdot&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}})-A({&#92;boldsymbol &#92;theta}).' title='L( {&#92;boldsymbol &#92;theta} ) = &#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}&#92;log p({&#92;bf x};{&#92;boldsymbol &#92;theta})={&#92;boldsymbol &#92;theta}&#92;cdot&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}})-A({&#92;boldsymbol &#92;theta}).' class='latex' /></p>
<p>If we want to use gradient ascent, we would just take a small step along the gradient.  This has a very intuitive form: it is the difference of the expected value of <img src='http://s0.wp.com/latex.php?latex=%5Cbf+f&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bf f' title='&#92;bf f' class='latex' /> under the model to the expected value of <img src='http://s0.wp.com/latex.php?latex=%5Cbf+f&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bf f' title='&#92;bf f' class='latex' /> under the current distribution.</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cfrac%7BdL%7D%7Bd%7B%5Cboldsymbol+%5Ctheta%7D%7D+%3D+%5Cfrac%7B1%7D%7BN%7D%5Csum_%7B%5Chat%7B%7B%5Cbf+x%7D%7D%7D%7B%5Cbf+f%7D%28%5Chat%7B%7B%5Cbf+x%7D%7D%29+-+%5Csum_%7B%5Cbf+x%7D+p%28%7B%5Cbf+x%7D%3B%7B%5Cboldsymbol+%5Ctheta%7D%29+%7B%5Cbf+f%7D%28%7B%5Cbf+x%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle &#92;frac{dL}{d{&#92;boldsymbol &#92;theta}} = &#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - &#92;sum_{&#92;bf x} p({&#92;bf x};{&#92;boldsymbol &#92;theta}) {&#92;bf f}({&#92;bf x})' title='&#92;displaystyle &#92;frac{dL}{d{&#92;boldsymbol &#92;theta}} = &#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - &#92;sum_{&#92;bf x} p({&#92;bf x};{&#92;boldsymbol &#92;theta}) {&#92;bf f}({&#92;bf x})' class='latex' />.</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cfrac%7BdL%7D%7Bd%7B%5Cboldsymbol+%5Ctheta%7D%7D+%3D+%5Cfrac%7B1%7D%7BN%7D%5Csum_%7B%5Chat%7B%7B%5Cbf+x%7D%7D%7D%7B%5Cbf+f%7D%28%5Chat%7B%7B%5Cbf+x%7D%7D%29+-+%7B%5Cboldsymbol+%5Cmu%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle &#92;frac{dL}{d{&#92;boldsymbol &#92;theta}} = &#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - {&#92;boldsymbol &#92;mu}({&#92;boldsymbol &#92;theta})' title='&#92;displaystyle &#92;frac{dL}{d{&#92;boldsymbol &#92;theta}} = &#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - {&#92;boldsymbol &#92;mu}({&#92;boldsymbol &#92;theta})' class='latex' />.</p>
<p>Note the lovely property of moment matching here. If we have found a solution, then <img src='http://s0.wp.com/latex.php?latex=dL%2Fd%7B%5Cboldsymbol+%5Ctheta%7D%3D0&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='dL/d{&#92;boldsymbol &#92;theta}=0' title='dL/d{&#92;boldsymbol &#92;theta}=0' class='latex' /> and so the expected value of <img src='http://s0.wp.com/latex.php?latex=%5Cbf+f&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bf f' title='&#92;bf f' class='latex' /> under the current distribution will be exactly equal to that under the data.</p>
<p>Unfortunately, in a high-treewidth setting, we can&#8217;t compute the marginals.  That&#8217;s too bad.  However, we have all these lovely approximate inference algorithms (loopy belief propagation, tree-reweighted belief propagation, mean field, etc.).  Suppose we write the resulting approximate marginals as <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7B%7B%5Cboldsymbol+%5Cmu%7D%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta})' title='&#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta})' class='latex' />.  Then, instead of taking the above gradient step, why not instead just use</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7BN%7D%5Csum_%7B%5Chat%7B%7B%5Cbf+x%7D%7D%7D%7B%5Cbf+f%7D%28%5Chat%7B%7B%5Cbf+x%7D%7D%29+-+%5Ctilde%7B%7B%5Cboldsymbol+%5Cmu%7D%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - &#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta})' title='&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - &#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta})' class='latex' />?</p>
<p>That&#8217;s all fine!  However, I often see people say/imply/write some or all of the following:</p>
<ol>
<li>This is not guaranteed to converge.</li>
<li>There is no longer any well-defined objective function being maximized.</li>
<li>We can&#8217;t use line searches.</li>
<li>We have to use (possibly stochastic) gradient ascent.</li>
<li>This whole procedure is frightening and shouldn&#8217;t be mentioned in polite company.</li>
</ol>
<p>I agree that we should view this procedure with some suspicion, but it gets far more than it deserves! The first four points, in my view, are simply wrong.</p>
<h3>What&#8217;s missing</h3>
<p>The critical thing that is missing from the above story is this: <em>Approximate marginals come together with an approximate partition function</em>!</p>
<p>That is, if you are computing approximate marginals <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7B%7B%5Cboldsymbol+%5Cmu%7D%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta})' title='&#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta})' class='latex' /> using loopy belief propagation, mean-field, or tree-reweighted belief propagation, there is a well-defined approximate log-partition function <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7BA%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;tilde{A}({&#92;boldsymbol &#92;theta})' title='&#92;tilde{A}({&#92;boldsymbol &#92;theta})' class='latex' /> such that</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Ctilde%7B%7B%5Cboldsymbol+%5Cmu%7D%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29+%3D+%5Cfrac%7Bd%5Ctilde%7BA%7D%7D%7Bd%7B%5Cboldsymbol+%5Ctheta%7D%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle &#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta}) = &#92;frac{d&#92;tilde{A}}{d{&#92;boldsymbol &#92;theta}}' title='&#92;displaystyle &#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta}) = &#92;frac{d&#92;tilde{A}}{d{&#92;boldsymbol &#92;theta}}' class='latex' />.</p>
<p>What this means is that you should think, not of approximating the likelihood <em>gradient</em>, but of approximating the likelihood <em>itself</em>. Specifically, what the above is really doing is optimizing the &#8220;surrogate likelihood&#8221;</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7BL%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29+%3D+%7B%5Cboldsymbol+%5Ctheta%7D%5Ccdot%5Cfrac%7B1%7D%7BN%7D%5Csum_%7B%5Chat%7B%7B%5Cbf+x%7D%7D%7D%7B%5Cbf+f%7D%28%5Chat%7B%7B%5Cbf+x%7D%7D%29-%5Ctilde%7BA%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29.&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;tilde{L}({&#92;boldsymbol &#92;theta}) = {&#92;boldsymbol &#92;theta}&#92;cdot&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}})-&#92;tilde{A}({&#92;boldsymbol &#92;theta}).' title='&#92;tilde{L}({&#92;boldsymbol &#92;theta}) = {&#92;boldsymbol &#92;theta}&#92;cdot&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}})-&#92;tilde{A}({&#92;boldsymbol &#92;theta}).' class='latex' /></p>
<p>What&#8217;s the gradient of this? It is</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7BN%7D%5Csum_%7B%5Chat%7B%7B%5Cbf+x%7D%7D%7D%7B%5Cbf+f%7D%28%5Chat%7B%7B%5Cbf+x%7D%7D%29+-+%5Ctilde%7B%7B%5Cboldsymbol+%5Cmu%7D%7D%28%7B%5Cboldsymbol+%5Ctheta%7D%29%2C&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - &#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta}),' title='&#92;frac{1}{N}&#92;sum_{&#92;hat{{&#92;bf x}}}{&#92;bf f}(&#92;hat{{&#92;bf x}}) - &#92;tilde{{&#92;boldsymbol &#92;mu}}({&#92;boldsymbol &#92;theta}),' class='latex' /></p>
<p>or exactly the gradient that was being used above. The advantage of doing things this way is that it is a normal optimization.  There is a well-defined objective. It can be plugged into a standard optimization routine, such as BFGS, which will probably be faster than gradient ascent.  Line searches guarantee convergence. <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7BA%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;tilde{A}' title='&#92;tilde{A}' class='latex' /> is perfectly tractable to compute. In fact, if you have already computed approximate marginals, <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7BA%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;tilde{A}' title='&#92;tilde{A}' class='latex' /> has almost no cost. Life is good.</p>
<p>The only counterargument I can think of is that mean-field and loopy BP can have different local optima, which <em>might</em> mean that a no-line-search-refuse-to-look-at-the-objective-function-just-follow-the-gradient-and-pray style optimization could be more robust, though I&#8217;d like to see that argument made&#8230;</p>
<p>I&#8217;m not sure of the history, but I think part of the reason this procedure has such a bad reputation (even from people that use it!) might be that it predates the &#8220;modern&#8221; understanding of inference procedures as producing approximate partition functions as well as approximate marginals.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/1071/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/1071/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/1071/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/1071/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/1071/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/1071/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/1071/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/1071/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/1071/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/1071/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/1071/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/1071/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/1071/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/1071/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1071&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2011/11/01/personal-opinions-about-graphical-models-1-the-surrogate-likelihood-exists-and-you-should-use-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Functions</title>
		<link>http://justindomke.wordpress.com/2011/10/19/functions/</link>
		<comments>http://justindomke.wordpress.com/2011/10/19/functions/#comments</comments>
		<pubDate>Wed, 19 Oct 2011 04:13:37 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=1056</guid>
		<description><![CDATA[I was pleased the other day to have cause to discover that this is valid matlab syntax: make_energy_fun = @(x,f) @(y,w) energy(y,x,f,w); My thoughts: This is great!  I&#8217;m creating an anonymous function which takes two parameters and returns an anonymous function taking two parameters which evaluates the energy with the original two parameters baked in. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1056&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I was pleased the other day to have cause to discover that this is valid matlab syntax:</p>
<p><code>make_energy_fun = @(x,f) @(y,w) energy(y,x,f,w);</code></p>
<p>My thoughts:</p>
<ul>
<li>This is great!  I&#8217;m creating an anonymous function which takes two parameters and returns an anonymous function taking two parameters which evaluates the energy with the original two parameters baked in. Then the (original) anonymous function is assigned to the variable <code>make_energy_fun</code>.</li>
<li>But&#8230; in a <a href="http://mitpress.mit.edu/sicp/full-text/book/book.html">civilized programming language </a>wouldn&#8217;t this be, literally, an easy problem on a freshman problem set?</li>
<li>&#8230;Why is it that we don&#8217;t use civilized programming languages, again?</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/1056/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/1056/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/1056/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/1056/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/1056/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/1056/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/1056/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/1056/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/1056/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/1056/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/1056/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/1056/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/1056/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/1056/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1056&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2011/10/19/functions/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Quotients</title>
		<link>http://justindomke.wordpress.com/2011/07/21/quotients/</link>
		<comments>http://justindomke.wordpress.com/2011/07/21/quotients/#comments</comments>
		<pubDate>Thu, 21 Jul 2011 18:33:37 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[math]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=1025</guid>
		<description><![CDATA[It seems to me that thinking of quotients as a fundamental operator is usually painful and unnecessary when the objects are almost anything other than real (or rational) numbers. Instead it is better to think of a quotient as a combination of the reciprocal and the product. A good example of this is complex numbers. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1025&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>It seems to me that thinking of quotients as a fundamental operator is usually painful and unnecessary when the objects are almost anything other than real (or rational) numbers.  Instead it is better to think of a quotient as a combination of the reciprocal and the product.  A good example of this is complex numbers.  Suppose that</p>
<p><img src='http://s0.wp.com/latex.php?latex=z%3Da%2Bbi&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='z=a+bi' title='z=a+bi' class='latex' /><br />
<img src='http://s0.wp.com/latex.php?latex=w%3Dc%2Bdi.&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='w=c+di.' title='w=c+di.' class='latex' /></p>
<p>Then, the usual rule for the quotient is that</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7Bz%2Fw+%3D+%5Cfrac%7Bac%2Bbd%7D%7Bc%5E2%2Bd%5E2%7D+%2B+i%5Cfrac%7Bbc-ad%7D%7Bc%5E2%2Bd%5E2%7D%7D.&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle{z/w = &#92;frac{ac+bd}{c^2+d^2} + i&#92;frac{bc-ad}{c^2+d^2}}.' title='&#92;displaystyle{z/w = &#92;frac{ac+bd}{c^2+d^2} + i&#92;frac{bc-ad}{c^2+d^2}}.' class='latex' /></p>
<p>This qualifies as non-memorizable.  On the other hand, take the reciprocal of <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='w' title='w' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7B1%2Fw+%3D+%5Cfrac%7Bc-di%7D%7Bc%5E2%2Bd%5E2%7D%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle{1/w = &#92;frac{c-di}{c^2+d^2}}' title='&#92;displaystyle{1/w = &#92;frac{c-di}{c^2+d^2}}' class='latex' />.</p>
<p>This is simple enough (&#8220;the complex conjugate divided by the squared norm&#8221;), and we recover the rule for the quotient easily enough by multiplying with <img src='http://s0.wp.com/latex.php?latex=z&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='z' title='z' class='latex' />.</p>
<p>The same thing holds true for derivatives.  I&#8217;ve never been able to remember that quotient rule from high-school; Namely that if <img src='http://s0.wp.com/latex.php?latex=f%28x%29%3Dg%28x%29%2Fh%28h%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='f(x)=g(x)/h(h)' title='f(x)=g(x)/h(h)' class='latex' />, then</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7Bf%27%28x%29+%3D+%5Cfrac%7Bh%28x%29g%27%28x%29-h%27%28x%29g%28x%29%7D%7Bh%28x%29%5E2%7D%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle{f&#039;(x) = &#92;frac{h(x)g&#039;(x)-h&#039;(x)g(x)}{h(x)^2}}' title='&#92;displaystyle{f&#039;(x) = &#92;frac{h(x)g&#039;(x)-h&#039;(x)g(x)}{h(x)^2}}' class='latex' /></p>
<p>Ick.  Instead, better to note that if <img src='http://s0.wp.com/latex.php?latex=r%28x%29+%3D+1%2Fh%28h%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='r(x) = 1/h(h)' title='r(x) = 1/h(h)' class='latex' /> then</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7Br%27%28x%29+%3D+%5Cfrac%7B-h%27%28x%29%7D%7Bh%28x%29%5E2%7D%7D%2C&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle{r&#039;(x) = &#92;frac{-h&#039;(x)}{h(x)^2}},' title='&#92;displaystyle{r&#039;(x) = &#92;frac{-h&#039;(x)}{h(x)^2}},' class='latex' /></p>
<p>along with the standard rule for differentiating products, so that if <img src='http://s0.wp.com/latex.php?latex=f%28x%29%3Dg%28x%29%2Fh%28x%29%3Dg%28x%29r%28x%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='f(x)=g(x)/h(x)=g(x)r(x)' title='f(x)=g(x)/h(x)=g(x)r(x)' class='latex' />, then</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7Bf%27%28x%29+%3D+g%28x%29r%27%28x%29%2Bg%27%28x%29r%28x%29%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;displaystyle{f&#039;(x) = g(x)r&#039;(x)+g&#039;(x)r(x)}' title='&#92;displaystyle{f&#039;(x) = g(x)r&#039;(x)+g&#039;(x)r(x)}' class='latex' />.</p>
<p>Another case would be the &#8220;matrix quotient&#8221; <img src='http://s0.wp.com/latex.php?latex=B+C%5E%7B-1%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='B C^{-1}' title='B C^{-1}' class='latex' />.  Of course, everyone already thinks of the matrix multiplication and inverse as separate operations&#8211; to do otherwise would be horrible&#8211; but I think that just proves the point&#8230;</p>
<p>(Although, I assume that computing <img src='http://s0.wp.com/latex.php?latex=BC%5E%7B-1%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='BC^{-1}' title='BC^{-1}' class='latex' /> as a single operation would be more numerically stable than first taking an explicit inverse.  This might mean something to people who feel that mathematical notation ought to suggest an obvious stable implementation in IEEE floating point (if any).)</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/1025/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=1025&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2011/07/21/quotients/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Example usage of JGMT</title>
		<link>http://justindomke.wordpress.com/2011/05/23/example-usage-of-jgmt/</link>
		<comments>http://justindomke.wordpress.com/2011/05/23/example-usage-of-jgmt/#comments</comments>
		<pubDate>Mon, 23 May 2011 16:38:55 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=989</guid>
		<description><![CDATA[Here is an example of usage of the graphical models toolbox I&#8217;ve just released. I&#8217;ll use the terminology of &#8220;perturbation&#8221; to refer to computing loss gradients from the difference of two problems as in this paper, and &#8220;truncated fitting&#8221; to refer to backpropagating derivatives through the TRW inference process for a fixed number of iterations, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=989&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Here is an example of usage of the <a href="http://people.rit.edu/jcdicsa/JGMT/">graphical models toolbox</a> I&#8217;ve just released.  I&#8217;ll use the terminology of &#8220;perturbation&#8221; to refer to computing loss gradients from the difference of two problems as in <a href="http://people.rit.edu/jcdicsa/papers/2010nips.pdf">this paper</a>, and &#8220;truncated fitting&#8221; to refer to backpropagating derivatives through the TRW inference process for a fixed number of iterations, as in <a href="http://people.rit.edu/jcdicsa/papers/2011cvpr.pdf">this paper</a>.</p>
<p>This is basically the simplest possible vision problem. We will train a conditional random field (CRF) to take some noisy image <img src='http://s0.wp.com/latex.php?latex=%5Cbf+x&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bf x' title='&#92;bf x' class='latex' /> as input:</p>
<p><a href="http://justindomke.files.wordpress.com/2011/05/x_1-250000e00.png"><img class="aligncenter size-full wp-image-990" title="x_1.250000e+00" src="http://justindomke.files.wordpress.com/2011/05/x_1-250000e00.png?w=380" alt=""   /></a></p>
<p>and produce marginals that well predict a binary image <img src='http://s0.wp.com/latex.php?latex=%5Cbf+y&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bf y' title='&#92;bf y' class='latex' /> as output:</p>
<p><a href="http://justindomke.files.wordpress.com/2011/05/y_1-250000e00.png"><img class="aligncenter size-full wp-image-991" title="y_1.250000e+00" src="http://justindomke.files.wordpress.com/2011/05/y_1-250000e00.png?w=380" alt=""   /></a></p>
<p>The noisy image <img src='http://s0.wp.com/latex.php?latex=%5Cbf+x&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;bf x' title='&#92;bf x' class='latex' /> is produced by setting <img src='http://s0.wp.com/latex.php?latex=x_i+%3D+y_i%281-t_i%5En%29+%2B+%281-y_i%29t_i%5En&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='x_i = y_i(1-t_i^n) + (1-y_i)t_i^n' title='x_i = y_i(1-t_i^n) + (1-y_i)t_i^n' class='latex' /> where <img src='http://s0.wp.com/latex.php?latex=t_i&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='t_i' title='t_i' class='latex' /> is random on <img src='http://s0.wp.com/latex.php?latex=%5B0%2C1%5D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='[0,1]' title='[0,1]' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='n' title='n' class='latex' /> is a noise level as described in <a href="http://people.rit.edu/jcdicsa/papers/2010nips.pdf">this paper</a>.  For now, we set <img src='http://s0.wp.com/latex.php?latex=n%3D1.25&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='n=1.25' title='n=1.25' class='latex' /> which is a pretty challenging amount of noise, as you see above.</p>
<p>The main file, which does the learning described here can be found in the toolbox in <tt>examples/train_binary_denoisers.m</tt>.</p>
<p>First off, we will train a model using perturbation, with the univariate likelihood loss function, based on TRW inference, with a convergence threshold of 1e-5.  We do this by typing:</p>
<pre>
&gt;&gt; train_binary_denoisers('pert_ul_trw_1e-5')
                                                        First-order
 Iteration  Func-count       f(x)        Step-size       optimality
     0           1         0.692759                         0.096
     1           3         0.686739       0.432616         0.0225
     2           5         0.682054             10         0.0182
     3           6         0.670785              1         0.0317
     4          10         0.542285        48.8796          0.932
     5          12         0.515509            0.1          0.965
     6          13         0.439039              1          0.355
     7          14         0.302082              1          0.279
     8          15         0.228832              1          0.471
     9          17         0.223659       0.344464         0.0159
    10          18         0.223422              1        0.00417
    11          19         0.223231              1        0.00269
    12          20         0.223227              1        0.00122
    13          22         0.223221        4.33583       0.000857
    14          23         0.223201              1        0.00121
    15          24         0.223138              1        0.00306
    16          25         0.223035              1        0.00509
    17          26         0.222903              1        0.00564
    18          27         0.222824              1         0.0035
    19          28         0.222806              1       0.000945
    20          29         0.222803              1       0.000798
    21          30         0.222802              1        0.00079
    22          31         0.222798              1        0.00111
    23          32         0.222788              1        0.00168
    24          33         0.222763              1        0.00255
    25          34         0.222707              1        0.00364
    26          35         0.222603              1        0.00435
    27          36         0.222479              1        0.00339
    28          37         0.222408              1        0.00117
    29          38         0.222394              1       9.64e-05
    30          39         0.222393              1       2.05e-05
    31          40         0.222393              1       4.06e-06
    32          41         0.222393              1       2.86e-07  
</pre>
<p>The final model trained has an error rate of 0.096.  We can visualize the marginals produced by making an image where each pixel has an intensity proportional to the predicted probability that that pixel takes label &#8220;1&#8243;.</p>
<p><a href="http://justindomke.files.wordpress.com/2011/05/b_out_pert_ul_trw_1e-5_1-250000e001.png"><img src="http://justindomke.files.wordpress.com/2011/05/b_out_pert_ul_trw_1e-5_1-250000e001.png?w=380" alt="" title="b_out_pert_ul_trw_1e-5_1.250000e+00"   class="aligncenter size-full wp-image-1002" /></a></p>
<p>On the other hand, we might train a model using truncated fitting, with the univariate likelihood, and 5 iterations of TRW.</p>
<p><tt>&gt;&gt; train_binary_denoisers('trunc_ul_trw_5')</tt></p>
<p>Sparing you the details of the optimization, this yields a total error rate of .0984 and the marginals:</p>
<p><a href="http://justindomke.files.wordpress.com/2011/05/b_out_trunc_ul_trw_5_1-250000e001.png"><img src="http://justindomke.files.wordpress.com/2011/05/b_out_trunc_ul_trw_5_1-250000e001.png?w=380" alt="" title="b_out_trunc_ul_trw_5_1.250000e+00"   class="aligncenter size-full wp-image-1004" /></a></p>
<p>Thus, restricting to only 5 iterations pays only a small accuracy penalty compared to a right convergence threshold.</p>
<p>Or, we could fit using the surrogate conditional likelihood.  (Here called E.M., though we don&#8217;t happen to have any hidden variables.)</p>
<p><tt>&gt;&gt; train_binary_denoisers('em_trw_1e-5')</tt></p>
<p>This yields an error rate of .1016, and the marginals:</p>
<p><a href="http://justindomke.files.wordpress.com/2011/05/b_out_em_trw_1e-5_1-250000e00.png"><img src="http://justindomke.files.wordpress.com/2011/05/b_out_em_trw_1e-5_1-250000e00.png?w=380" alt="" title="b_out_em_trw_1e-5_1.250000e+00"   class="aligncenter size-full wp-image-1006" /></a></p>
<p>There are many permutations of loss functions, inference algorithms, etc.  (Some of which have not yet been published.)  Rather than exhaust all the possibilities, I&#8217;ll just list a bunch of examples:</p>
<p><tt>'pert_ul_trw_1e-5'</tt> (Perturbation + univariate likelihood + TRW with 1e-5 threshold)</p>
<p><tt>'trunc_cl_trw_5'</tt> (Truncated Fitting + clique likelihood + TRW with 5 iterations)</p>
<p><tt>'trunc_cl_mnf_5'</tt> (Truncated Fitting + clique likelihood + Mean Field with 5 iterations)</p>
<p><tt>'trunc_em_trw_5'</tt> (Truncated EM, with TRW used to compute both upper and lower bounds on partition function + 5 iterations)</p>
<p><tt>'em_trw_1e-5'</tt> (Regular EM, with TRW used to compute both upper and lower bounds on partition function + 1e-5 threshold)</p>
<p><tt>'em_trw/mnf_1e-5'</tt> (Regular EM, with TRW used for upper bound and meanfield used for lower bound + 1e-5 threshold)</p>
<p><tt>'pseudo'</tt> (Pseudolikelihood)</p>
<p>About the pseudolikelihood, let&#8217;s try it.</p>
<p>&gt;&gt; train_binary_denoisers(&#8216;pseudo&#8217;)</p>
<p>This yields an near-change error rate of .44, and the marginals (produced by TRW)</p>
<p><a href="http://justindomke.files.wordpress.com/2011/05/b_out_pseudo_1-250000e00.png"><img src="http://justindomke.files.wordpress.com/2011/05/b_out_pseudo_1-250000e00.png?w=380" alt="" title="b_out_pseudo_1.250000e+00"   class="aligncenter size-full wp-image-1011" /></a></p>
<p>Which is why you probably shouldn&#8217;t use the pseudolikelihood&#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/989/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=989&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2011/05/23/example-usage-of-jgmt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>

		<media:content url="http://justindomke.files.wordpress.com/2011/05/x_1-250000e00.png" medium="image">
			<media:title type="html">x_1.250000e+00</media:title>
		</media:content>

		<media:content url="http://justindomke.files.wordpress.com/2011/05/y_1-250000e00.png" medium="image">
			<media:title type="html">y_1.250000e+00</media:title>
		</media:content>

		<media:content url="http://justindomke.files.wordpress.com/2011/05/b_out_pert_ul_trw_1e-5_1-250000e001.png" medium="image">
			<media:title type="html">b_out_pert_ul_trw_1e-5_1.250000e+00</media:title>
		</media:content>

		<media:content url="http://justindomke.files.wordpress.com/2011/05/b_out_trunc_ul_trw_5_1-250000e001.png" medium="image">
			<media:title type="html">b_out_trunc_ul_trw_5_1.250000e+00</media:title>
		</media:content>

		<media:content url="http://justindomke.files.wordpress.com/2011/05/b_out_em_trw_1e-5_1-250000e00.png" medium="image">
			<media:title type="html">b_out_em_trw_1e-5_1.250000e+00</media:title>
		</media:content>

		<media:content url="http://justindomke.files.wordpress.com/2011/05/b_out_pseudo_1-250000e00.png" medium="image">
			<media:title type="html">b_out_pseudo_1.250000e+00</media:title>
		</media:content>
	</item>
		<item>
		<title>Graphical Models Toolbox</title>
		<link>http://justindomke.wordpress.com/2011/05/23/graphical-models-toolbox/</link>
		<comments>http://justindomke.wordpress.com/2011/05/23/graphical-models-toolbox/#comments</comments>
		<pubDate>Mon, 23 May 2011 16:38:05 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[graphical models]]></category>
		<category><![CDATA[matlab]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=982</guid>
		<description><![CDATA[I&#8217;m releasing code for a &#8220;toolbox&#8221; of code for learning and inference with graphical models. It is focused on parameter learning using marginalization in the high-treewidth setting. Though the code is, in principle, domain independent, I&#8217;ve developed it with vision problems in mind. This means that the code is A) efficient (all the inference algorithms [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=982&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m releasing code for a &#8220;<a href="http://people.rit.edu/jcdicsa/JGMT/">toolbox</a>&#8221; of code for learning and inference with graphical models.  It is focused on parameter learning using marginalization in the high-treewidth setting.  Though the code is, in principle, domain independent, I&#8217;ve developed it with vision problems in mind.  This means that the code is A) efficient (all the inference algorithms are implemented in C++) and B) can handle arbitrary graph structures.</p>
<p>There are, at present, a bunch of limitations:</p>
<ul>
<li>All the inference algorithms are for <i>marginal</i> inference.  No MAP inference, at all.</li>
<li>The code handles pairwise graphs only</li>
<li>All variables must have the same number of possible values.</li>
<li>For tree-reweighted belief propagation, a <i>single</i> edge appearance probability must be used for all edges</li>
</ul>
<p>For vision, these are usually no big deal.  (Except if you are a MAP inference person.  But that is not advisable.)  In other domains, though, these might be showstoppers.</p>
<p>The code can be used in a bunch of different ways, depending on if you are looking for a specific tool to use, or a large framework.</p>
<ul>
<li>Just use the low-level [Inference] algorithms, namely A) Tree-Reweighted Belief propagation + variants (Loopy BP, TRW-S) or B) Mean-field.  Take care of everything else yourself.</li>
<li>Use the [Differentiation] methods (back-TRW or implicit differentiation) to calculate parameter gradients by providing your own loss functions.  Do everything else on your own.</li>
<li>Use the [Loss] methods (em, implicit_loss) to calculate parameter gradients by providing a true vector x and a loss name (univariate likelihood, clique likelihood, etc.)  Unlike the above usages, these methods explicitly consider the conditional learning setting where one has an input and an output.</li>
<li>Use the [CRF] methods to calculate calculate almost everything (deal with parameter ties for a specific type of model, etc.)  These methods consider specific classes of CRFs and given and input, output, loss function, inference method, etc. give the parameter gradient.  Employing this gradient in a learning framework is quite straightforward.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/982/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=982&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2011/05/23/graphical-models-toolbox/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Matrix Calculus</title>
		<link>http://justindomke.wordpress.com/2011/03/24/matrix-calculus/</link>
		<comments>http://justindomke.wordpress.com/2011/03/24/matrix-calculus/#comments</comments>
		<pubDate>Thu, 24 Mar 2011 23:47:21 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[calculus]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[matrix]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=970</guid>
		<description><![CDATA[Based on a lot of requests from students, I did a lecture on matrix calculus in my machine learning class today. This was based on Minka&#8217;s Old and New Matrix Algebra Useful for Statistics and Magnus and Neudecker&#8217;s Matrix Differential Calculus with Applications in Statistics and Econometrics. In making the notes, I used a couple [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=970&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Based on a lot of requests from students, I did a <a href="http://people.rit.edu/jcdicsa/courses/SML/01background.pdf">lecture on matrix calculus</a> in my machine learning class today.  This was based on Minka&#8217;s <a href="http://research.microsoft.com/en-us/um/people/minka/papers/matrix/">Old and New Matrix Algebra Useful for Statistics</a> and Magnus and Neudecker&#8217;s Matrix Differential Calculus with Applications in Statistics and Econometrics.</p>
<p>In making the notes, I used a couple innovations, which I am still debating the wisdom of.  The first is the rule for calculating derivatives of scalar-valued functions of a matrix input <img src='http://s0.wp.com/latex.php?latex=f%28X%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='f(X)' title='f(X)' class='latex' />.  Traditionally, this is written like so:</p>
<p>if <img src='http://s0.wp.com/latex.php?latex=dy+%3D+%5Ctext%7Btr%7D%28A%5ET+dX%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='dy = &#92;text{tr}(A^T dX)' title='dy = &#92;text{tr}(A^T dX)' class='latex' /> then <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7Bdy%7D%7BdX%7D+%3D+A&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;frac{dy}{dX} = A' title='&#92;frac{dy}{dX} = A' class='latex' />.</p>
<p>I initially found the presence of the trace here baffling.  However, there is the simple rule that</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ctext%7Btr%7D%28A%5ET+B%29+%3D+A+%5Ccdot+B&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;text{tr}(A^T B) = A &#92;cdot B' title='&#92;text{tr}(A^T B) = A &#92;cdot B' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=%5Ccdot&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;cdot' title='&#92;cdot' class='latex' /> is the matrix inner product.  This puts the rule in the much more intuitive (to me!) form:</p>
<p>if <img src='http://s0.wp.com/latex.php?latex=dy+%3D+A+%5Ccdot+dX&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='dy = A &#92;cdot dX' title='dy = A &#92;cdot dX' class='latex' /> then <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7Bdy%7D%7BdX%7D+%3D+A&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;frac{dy}{dX} = A' title='&#92;frac{dy}{dX} = A' class='latex' />.</p>
<p>This seems more straightforward, but it comes at a cost.  When working with the rule in the trace form, one often needs to do quite a bit of shuffling around of matrices.  This is easy to do using the standard trace identities like</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Ctext%7Btr%7D%28ABC%29%3D%5Ctext%7Btr%7D%28CAB%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;text{tr}(ABC)=&#92;text{tr}(CAB)' title='&#92;text{tr}(ABC)=&#92;text{tr}(CAB)' class='latex' />.</p>
<p>If we are to work with inner-products, we will require a similar set of rules.  It isn&#8217;t too hard to show that there are &#8220;dual&#8221; identities like</p>
<p><img src='http://s0.wp.com/latex.php?latex=A+%5Ccdot+%28BC%29+%3D+B+%5Ccdot+%28AC%5ET%29+%3D+C+%5Ccdot+%28B%5ET+A%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='A &#92;cdot (BC) = B &#92;cdot (AC^T) = C &#92;cdot (B^T A)' title='A &#92;cdot (BC) = B &#92;cdot (AC^T) = C &#92;cdot (B^T A)' class='latex' /></p>
<p>which allow a similar shuffling with dot products.  Still, these are certainly less easy to remember.</p>
<p>There are also a set of other rules that seem to be needed in practice, but aren&#8217;t included in standard texts.  For example, if <img src='http://s0.wp.com/latex.php?latex=R&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='R' title='R' class='latex' /> is a function that is applied elementwise to a matrix or vector (e.g. <img src='http://s0.wp.com/latex.php?latex=%5Csin&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;sin' title='&#92;sin' class='latex' />), then</p>
<p><img src='http://s0.wp.com/latex.php?latex=d%28R%28F%29%29+%3D+R%27%28F%28X%29%29+%5Codot+dF&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='d(R(F)) = R&#039;(F(X)) &#92;odot dF' title='d(R(F)) = R&#039;(F(X)) &#92;odot dF' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=%5Codot&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;odot' title='&#92;odot' class='latex' /> is the elementwise product.  This then requires other (very simple) identities for getting rid of the elementwise product, such as</p>
<p><img src='http://s0.wp.com/latex.php?latex=%7B%5Cbf+x%7D%5Codot%7B%5Cbf+y%7D+%3D+%5Ctext%7Bdiag%7D%28%7B%5Cbf+x%7D%29+%7B%5Cbf+y%7D+%3D+%5Ctext%7Bdiag%7D%28%7B%5Cbf+y%7D%29+%7B%5Cbf+x%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='{&#92;bf x}&#92;odot{&#92;bf y} = &#92;text{diag}({&#92;bf x}) {&#92;bf y} = &#92;text{diag}({&#92;bf y}) {&#92;bf x}' title='{&#92;bf x}&#92;odot{&#92;bf y} = &#92;text{diag}({&#92;bf x}) {&#92;bf y} = &#92;text{diag}({&#92;bf y}) {&#92;bf x}' class='latex' />.</p>
<p>Another issue with using dot products everywhere is the need to constantly convert between transposes and inner-products.  (This issue comes up because I prefer a &#8220;all vectors are column vectors&#8221; convention)   The never-ending debate of if we should write</p>
<p><img src='http://s0.wp.com/latex.php?latex=%7B%5Cbf+x%7D+%5Ccdot+%7B%5Cbf+y%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='{&#92;bf x} &#92;cdot {&#92;bf y}' title='{&#92;bf x} &#92;cdot {&#92;bf y}' class='latex' /></p>
<p>or</p>
<p><img src='http://s0.wp.com/latex.php?latex=%7B%5Cbf+x%7D%5ET+%7B%5Cbf+y%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='{&#92;bf x}^T {&#92;bf y}' title='{&#92;bf x}^T {&#92;bf y}' class='latex' /></p>
<p>seems to have particular importance here, and I&#8217;m not sure of the best choice.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/970/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=970&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2011/03/24/matrix-calculus/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Automatic Differentiation Without Compromises</title>
		<link>http://justindomke.wordpress.com/2009/11/30/automatic-differentiation-without-compromises/</link>
		<comments>http://justindomke.wordpress.com/2009/11/30/automatic-differentiation-without-compromises/#comments</comments>
		<pubDate>Mon, 30 Nov 2009 16:39:04 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[calculus]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=902</guid>
		<description><![CDATA[Automatic differentiation is a classic numerical method that takes a program, and (with minimal programmer effort) computes the derivatives of that program. This is very useful because, when optimizing complex functions, a lot of time tends to get spent manually deriving and then writing code for derivatives. Some systems like cvx do a great job [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=902&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/">Automatic differentiation</a> is a classic numerical method that takes a program, and (with minimal programmer effort) computes the derivatives of that program.  This is very useful because, when optimizing complex functions, a lot of time tends to get spent manually deriving and then writing code for derivatives.  Some systems like <a href="http://www.stanford.edu/~boyd/cvx/">cvx</a> do a great job of recognizing simple problem types (linear programs, cone programs, etc.), but can&#8217;t handle arbitrary functions.</p>
<p>At present, automatic differentiation involves a degree of pain.  Typically, the user needs to write C or C++ or Fortran code and augment the code with &#8220;taping&#8221; commands to get it to work.  There is also usually a significant computational overhead.</p>
<p>In a just world, there would be no pain, and there would be no computational overhead.  In a just world, reverse-mod autodiff would work like this:</p>
<p><strong>STEP 1</strong> The user writes the function they want to optimize in a convenient high level language like Python.</p>
<p>This is easy, but slow, and provides no derivatives.</p>
<p>Consider a function that sums the elements in a matrix product.  The user (apparently unaware that numpy provides matrix multiplication) writes:</p>
<pre>
    def fun(A,B):
        rez = 0
        for i in range(A.shape[0]):
            for j in range(B.shape[1]):
                rez += dot(A[i,:],B[:,j])
        return rez
</pre>
<p><strong>STEP 2</strong> The user feeds their high level function into a library.  This library uses operator overloading magic to build an <a href="http://en.wikipedia.org/wiki/File:Expression_Graph.svg">expression graph</a> representation of the function.  This expression graph is then used to generate efficient machine ode for both the function, and its derivatives.</p>
<p>The user merely should have to do something like</p>
<pre>    A = numpy.random.rand(3,5)
    B = numpy.random.rand(5,4)
    dfun = compile_fun(fun,A,B)
</pre>
<p>They could then compute the function and derivatives (all in compiled code) using</p>
<pre>    F,[dFdA,dFdB] = dfun(A,B)
</pre>
<p>Thus, the user gets everything, with out having to compromise:  maximum performance, and maximum convenience.</p>
<hr />Since this is how things <em>should</em> work, I spent some time this past summer writing a library that does.  The above code is actually a working example, calling the <code>compile_fun</code> method&#8211; the only method a consumer of the library needs to know about.  This library comes tantalizingly close to my goals.</p>
<p>Another simple example, mixing matrices and scalars:</p>
<pre>
def fun(A,B,a):
    return sum(sum((A*a-B)**2))
a = 2.
A = random.rand(15,22)
B = random.rand(15,22)
import etree
dfun = etree.compile_fun(fun,A,B,a)
F,[dFdA,dFdB,dFda] = dfun(A,B,a)
</pre>
<p>Consider the above matrix multiplication example, with 60&#215;60 matrices.  The library can generate the expression graph, transform that into a simple bytecode, than transform that bytecode into C code <em>acceptably</em> quickly&#8211; 11.97 seconds on my machine.  The code then runs very fast:  .0086 seconds as opposed to .0254 seconds for just running the original function or .0059 seconds for calling numpy&#8217;s matrix multiplication routine <code>dot(A,B)</code>.  (Which, of course, does not provide derivatives!)</p>
<p>There is, inevitably, one large problem:  horrific compilation times on large functions.  To take the C code and transform it into machine code, gcc takes <strong>1116 seconds</strong>.  Why, you ask?  The reason is because gcc is trying to compile a single <strong>36.5 MB</strong> function:</p>
<pre>
#include "math.h"
void rock(double* A, double* D){
A[12800]=A[0] * A[6400];
A[12801]=A[1] * A[6480];
A[12802]=A[12800] + A[12801];
A[12803]=A[2] * A[6560];
A[12804]=A[12802] + A[12803];
A[12805]=A[3] * A[6640];
// thousands of pages of same
D[12801] = D[12802];
D[1] += D[12801] * A[6480];
D[6480] += D[12801] * A[1];
D[0] += D[12800] * A[6400];
D[6400] += D[12800] * A[0];
}
</pre>
<p>Though this is very simple code, it seems that gcc still uses some super-linear algorithms, so it can&#8217;t handle large functions.</p>
<p>(To be clear, for small or medium sized problems, the code compiles very quickly.)</p>
<p>Now, clearly, I am doing something wrong.  Frankly I am writing this largely in the hope that someone who actually knows something about compilers and programming languages can enlighten me.</p>
<p>1) Use a better compiler.  I&#8217;ve tried this.  I can turn off gcc&#8217;s optimization, which decreases compilation times, though to a still very slow level.  Alternatively, I could use a fast compiler.  <a href="http://bellard.org/tcc/">tcc</a> flies right through the code.  Unfortunately, the machine code that tcc generates is very slow&#8211; I think due to poor register allocation.</p>
<p>2) Don&#8217;t generate intermediate C code&#8211; directly generate machine code or assembly.  This would certainly work, but there is no way I am up to it.  Maybe it would make sense to target the <a href="http://en.wikipedia.org/wiki/Java_Virtual_Machine">JVM</a>?</p>
<p>3) Use one of the newfangled just in time virtual machine things, like <a href="http://llvm.org/">LLVM</a> or <a href="http://en.wikipedia.org/wiki/LibJIT">libJIT</a>.  This seems promising, but after quite a bit of research I&#8217;m still really unsure about how well this is likely to work, or how to proceed.</p>
<p>4) Write a fast interpreter, and just pass the bytecode to this.  I&#8217;ve tried this as well.  This is basically the strategy used by some autodiff packages, e.g. <a href="http://www.coin-or.org/CppAD/">cppad</a>.  This can get reasonable performance, but will never compete with native code.  The reason is&#8211; I think&#8211; that the interpreter is basically a giant switch statement, and all the branching screws up the pipeline on the CPU.</p>
<p>Anyway, the code is available as <a href="http://people.rit.edu/jcdicsa/code/etree/etree.py">etree.py</a>, and a tiny file <a href="http://people.rit.edu/jcdicsa/code/etree/mycode.h">mycode.h</a>.  (Let&#8217;s say it&#8217;s available under the <a href="http://www.opensource.org/licenses/mit-license.php">MIT license</a>, and the condition that you not laugh too much at my programming skills.)  To try it out, just start python and do</p>
<pre>
import etree
etree.runtests()
</pre>
<p>At this point, the library <em>is</em> useful (I use it!) but only for relatively small problems, containing no more than, say, a few thousand variables.</p>
<p>If anyone has ideas for better compiler flags, or knows about the feasibility of targeting LLVM or libJIT instead of C code, please get in touch.</p>
<p>Disclaimer:  There are a bunch of minor issues with the library.  These are all fixable, but it isn&#8217;t really worth the effort unless the compilation times can be dealt with.</p>
<ul>
<li>The function cannot branch.  (No if statements).</li>
<li>If you want to multiply a matrix by a scalar, the scalar must come after the matrix.</li>
<li> I know it works on 32 bit ubuntu with everything installed via synaptic package manager, and on 64 bit OS X, simply using <a href="http://www.sagemath.org/">Sage</a> (with <a href="http://groups.google.com/group/sage-support/msg/21d17df3acf3406d">this patch</a> to make scipy/weave work).  If you want to play around with this, but aren&#8217;t a python programmer, I would suggest using Sage.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/902/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/902/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/902/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/902/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/902/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/902/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/902/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/902/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/902/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/902/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/902/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/902/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/902/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/902/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=902&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/11/30/automatic-differentiation-without-compromises/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Notation is evil</title>
		<link>http://justindomke.wordpress.com/2009/10/28/notation-is-evil/</link>
		<comments>http://justindomke.wordpress.com/2009/10/28/notation-is-evil/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 02:00:40 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=839</guid>
		<description><![CDATA[Exhibit A: We have the symbol ,with the interpretation for some number , but there doesn&#8217;t appear to exist a symbol (Here, I use a boxed question mark: to denote the symbol I claim doesn&#8217;t exist) with the interpretation for some number . This pains me.  People sometimes have to resort to writing something like [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=839&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Exhibit A:</strong></p>
<p>We have the symbol <img src='http://s0.wp.com/latex.php?latex=%5Cpropto&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;propto' title='&#92;propto' class='latex' />,with the interpretation</p>
<p><img src='http://s0.wp.com/latex.php?latex=x+%5Cpropto+y+%5Cleftrightarrow+x+%3D+c+y&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='x &#92;propto y &#92;leftrightarrow x = c y' title='x &#92;propto y &#92;leftrightarrow x = c y' class='latex' /> for some number <img src='http://s0.wp.com/latex.php?latex=c&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='c' title='c' class='latex' />,</p>
<p>but there doesn&#8217;t appear to exist a symbol (Here, I use a boxed question mark: <img src='http://s0.wp.com/latex.php?latex=%5Cboxed%7B%3F%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;boxed{?}' title='&#92;boxed{?}' class='latex' /> to denote the symbol I claim doesn&#8217;t exist) with the interpretation</p>
<p><img src='http://s0.wp.com/latex.php?latex=x+%5Cboxed%7B%3F%7D+y+%5Cleftrightarrow+x+%3D+y+%2B+c&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='x &#92;boxed{?} y &#92;leftrightarrow x = y + c' title='x &#92;boxed{?} y &#92;leftrightarrow x = y + c' class='latex' /> for some number <img src='http://s0.wp.com/latex.php?latex=c&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='c' title='c' class='latex' />.</p>
<p>This pains me.  People sometimes have to resort to writing something like</p>
<p>&#8220;<img src='http://s0.wp.com/latex.php?latex=y+%3D+f%28x%29%2B%5Ctext%7Bconst%7D+%281%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='y = f(x)+&#92;text{const} (1)' title='y = f(x)+&#92;text{const} (1)' class='latex' /></p>
<p><img src='http://s0.wp.com/latex.php?latex=%3D+g%28x%29+%2B+%5Ctext%7Bconst%7D+%282%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='= g(x) + &#92;text{const} (2)' title='= g(x) + &#92;text{const} (2)' class='latex' /></p>
<p>where the constants are (in general) different on lines (1) and (2).&#8221;</p>
<p>Even worse (or maybe not?), sometimes people seem to leave exponents lying around when they otherwise wouldn&#8217;t, e.g. write</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cexp%28y%29+%5Cpropto+%5Cexp%28f%28x%29%29+%5Cpropto+%5Cexp%28g%28x%29%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;exp(y) &#92;propto &#92;exp(f(x)) &#92;propto &#92;exp(g(x))' title='&#92;exp(y) &#92;propto &#92;exp(f(x)) &#92;propto &#92;exp(g(x))' class='latex' />.</p>
<p><strong>Exhibit B:</strong></p>
<p>We have no symbol meaning &#8220;normalized sum&#8221;.  How many thousands of times have you seen some variant of</p>
<p>&#8220;<img src='http://s0.wp.com/latex.php?latex=y+%3D+%5Cfrac%7B1%7D%7BN%7D%5Csum_%7Bx+%5Cin+X%7D+f%28x%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='y = &#92;frac{1}{N}&#92;sum_{x &#92;in X} f(x)' title='y = &#92;frac{1}{N}&#92;sum_{x &#92;in X} f(x)' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=N%3D%7CX%7C&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='N=|X|' title='N=|X|' class='latex' />&#8220;?</p>
<p>Why do we need to define <img src='http://s0.wp.com/latex.php?latex=N&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='N' title='N' class='latex' />?  Can&#8217;t we just use another mystery symbol to write</p>
<p><img src='http://s0.wp.com/latex.php?latex=y+%3D+%5Cboxed%7B%3F%7D_%7Bx+%5Cin+X%7D+f%28x%29&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='y = &#92;boxed{?}_{x &#92;in X} f(x)' title='y = &#92;boxed{?}_{x &#92;in X} f(x)' class='latex' />?</p>
<p>In some situations you could use <img src='http://s0.wp.com/latex.php?latex=%5Ctext%7Bmean%7D&amp;bg=ffffff&amp;fg=333333&amp;s=0' alt='&#92;text{mean}' title='&#92;text{mean}' class='latex' />, but that doesn&#8217;t always really work and is rarely done.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/839/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=839&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/10/28/notation-is-evil/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Fitting an inference algorithm instead of a model</title>
		<link>http://justindomke.wordpress.com/2009/08/18/fitting-an-inference-algorithm-instead-of-a-model/</link>
		<comments>http://justindomke.wordpress.com/2009/08/18/fitting-an-inference-algorithm-instead-of-a-model/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 22:29:23 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[probability]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=741</guid>
		<description><![CDATA[One recent trend seems to be the realization that one can get better performance by tuning a CRF (Conditional Random Field) to a particular inference algorithm. Basically, forget about the distribution that the CRF represents, and instead only care how accurate are the results that pop out of inference. An extreme example of this is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=741&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>One recent trend seems to be the realization that one can get better performance by tuning a CRF (Conditional Random Field) to a particular inference algorithm.  Basically, forget about the <em>distribution</em> that the CRF represents, and instead only care how accurate are the <em>results</em> that pop out of inference.  An extreme example of this is the recent paper <a href="http://stat.fsu.edu/~abarbu/papers/barbu_denoise_cvpr09.pdf">Learning Real-Time MRF Inference for Image Denoising</a> by <a href="http://stat.fsu.edu/~abarbu/">Adrian Barbu</a>.</p>
<p>The basic idea is to fit a FoE (<a href="http://www.gris.informatik.tu-darmstadt.de/~sroth/research/foe/index.html">Field of Experts</a>) image prior such that when one takes a very few gradient descent steps on a denoising posterior, the results are accurate.  From the abstract:</p>
<blockquote><p>We argue that through appropriate training, a MRF/CRF model can be trained to perform very well on a suboptimal inference algorithm. The model is trained together with a fast inference algorithm through an optimization of a loss function [...] We apply the proposed method to an image denoising application [...] with a 1-4 iteration gradient descent inference algorithm. [...] the proposed training approach obtains an improved  benchmark performance as well as a 1000-3000 times speedup compared to the Fields of Experts MRF.</p></blockquote>
<p>The implausible-sounding 1000-fold speedup comes simply from using only 4 iterations of gradient descent rather than several thousand.  (Incidentally, L-BFGS requires far fewer iterations for this problem.)  The results are a bit better than the generative FoE model&#8211; that takes much more work for training and inference.</p>
<p>I have every confidence that this does work well, and similar strategies could probably be used to create fast inference models/algorithms for many different problems.  My <a href="http://www.cs.umd.edu/~domke/papers/thesis.pdf">thesis</a> was largely an attempt to do the same thing for marginal, rather than MAP inference.</p>
<p>The disturbing/embarrassing question, for me, is does this really have anything to do with probabilistic modeling any more?  Formally speaking, a probability density is being fit, but I doubt it would transfer to, say, inpainting, or that samples from the density would look like natural images.  The best interpretation of what is happening might be that one is simply fitting a big, nonlinear, black box function approximation.</p>
<p>It seems that the more effort we expend to wring the best performance out of a probabilistic model, the less &#8220;probabilistic&#8221; it is.</p>
<h6>P.S. Some of my friends have invited me to never mention autodiff again, ever, but this is one of the many papers where I think the learning optimization would be made much easier/faster by using it.</h6>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/741/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&amp;blog=4009146&amp;post=741&amp;subd=justindomke&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/08/18/fitting-an-inference-algorithm-instead-of-a-model/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
	</channel>
</rss>
