A Cleaner Gaussian Distribution

August 27, 2008

The problem with the Gaussian distribution is that the normalization constant is too complicated.

p(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{x^2}{2\sigma^2})

I admit it really isn’t particularly complicated, but in its many forms– multivariate, conditional, CDF, etc. these things continue to cause annoyance.  In particular, I am frequently finding that I introduce bugs when I write code using Gaussians.

Now, can this be simplified?  It can.  Notice that

\int_{x=-\infty}^\infty a^{-\frac{x^2}{\sigma^2}}=\frac{\sigma\sqrt{\pi}}{ \sqrt{\log(a)} }

So, choosing a=e^\pi, and defining \text{axp}(x)=a^x, we can instead write a Gaussian in the form

p(x)=\frac{1}{\sigma} \text{axp}(-\frac{x^2}{\sigma^2}).

By changing \sigma, this represents any normal Gaussian.

Now, that’s slightly nicer than a regular gaussian, but can it extend to higher dimensions?  (I admit I have to look up the normalization constant for a multivariate Gaussian every time I use one.)  Unfortunately, it doesn’t seem so.  The trouble is that (here x is now a vector)

\int_{x=-\infty}^\infty \exp(-\frac{1}{2}x^T \Sigma^{-1} x)=(2\pi)^{d/2} |\Sigma|^{1/2}

where d is the number of dimensions (Matrix Cookbook).  This means that if we are again going to define the constant a to try to make the normalization constant disappear, a would have to depend on the dimensions of the problem.  That seems odd.


Quantum Mechanics

August 9, 2008

This “diavlog” is fascinating.  They discuss the meaning of quantum mechanics, and how that relates to the issue of what a “measurement” is, and how that relates to what “consciousness” is.

I like this site’s trick of speeding up the video by a factor of 1.4, with clever signal processing to avoid making the participants sound like chipmunks.  It is strange that we can effortlessly listen to people speaking faster than they naturally speak.  (On first watching these videos, one doesn’t even notice that it has been sped up– the participants just seem to have had way too much coffee.)


Marginal Beliefs of Random MRFs

July 26, 2008

A pairwise Markov Random Field is a way of defining a probability distribution over some vector {\bf x}. One way to write one is

p({\bf x}) \propto \exp( \sum_i \phi(x_i) + \sum_{(i,j)} \psi(x_i,x_j) ).

Where the first sum is over all the variables, and the second sum is over neighboring pairs. Here, I generated some random distributions over binary valued variables. For each i, I set \phi(x_i=0)=0, and \phi(x_i=1)=r_i where r_i is some value randomly chosen from a standard Gaussian. For the pairwise terms, I used \psi(x_i,x_j) = .75 \cdot I(x_i=x_j). (i.e. \psi(x_i,x_j) is .75 when the arguments are the same, and zero otherwise.) This is an “attractive network”, where neighboring variables want to have the same value.

Computing marginals p(x_i) is hard in graphs that are not treelike. Here, I approximate them using a nonlinear minimization of a “free energy” similar to that used in loopy belief propagation.

Here, I show the random single-variate biases r_i and the resulting beliefs.  What we see is constant valued regions (encouraged by \psi) interrupted where the $\phi$ is very strong.

Now, with more variables.

Now, a “repellent” network. I repeated the procedure above, but changed the pairwise interactions to \psi(x_i,x_j) = -.75 \cdot I(x_i\not=x_j). Neighboring variables want to have different values.  Notice this is the opposite of the above behavior– regions of “checkerboard” interrupted where the $\phi$ outvotes \psi.

Now, the repellent network with more variables.


More Vector Calculus Identities

July 24, 2008

I realize that “rule 2″ from the previous post is actually just a special case of the vector chain rule.

Rule 4. (Chain rule)

If f({\bf x}) = {\bf g}({\bf h}({\bf x})), then

J[{\bf f}] = J[{\bf g}]J[{\bf h}], or equivalently,

\frac{ \partial{\bf f} }{ \partial{\bf x}^T} = \frac{ \partial{\bf g} }{ \partial{\bf h}^T} \frac{ \partial{\bf h} }{ \partial{\bf x}^T}.

Here, I have used {\bf h} to denote the argument of {\bf g}. (That makes it look more like the usual chain rule.)

From this, you get the special case where g is a scalar function. (I use the non-boldface g in g({\bf h}) to suggest that g is a scalar function that operates ‘element-wise’ on vector input.)

Rule 4. (Chain rule– special case for a scalar function)

If f({\bf x}) = g({\bf h}({\bf x})), then

J[{\bf f}]({\bf x}) = \text{diag}[ g'({\bf h})] J[{\bf h}], or equivalently,

J[{\bf f}]({\bf x}) = g'({\bf h}) {\bf 1}^T \odot J[{\bf h}].

In the last line, I use the fact that \text{diag}({\bf x})A = {\bf x}{\bf 1}^T \odot A.

Finally, substituting g = g' = \exp gives the special case below.


Vector Calculus Identities

July 22, 2008

Suppose

g({\bf x}) = {\bf p}({\bf x})^T {\bf q}({\bf x}).

That is, g is a scalar function of a vector {\bf x}, given by taking the inner product of the two vector valued functions {\bf p} and {\bf q}. Now, we would like the gradient of g, i.e. \frac{\partial g}{\partial \bf x}. What is it?

I frequently need to find such derivatives, and I have never been able to find any reference for rules to calculate them. (Though such rules surely exist somewhere!) Today, Alap and I derived a couple simple rules. The first answers the above question.

Rule 1.

If g({\bf x}) = {\bf p}({\bf x})^T {\bf q}({\bf x}), then,

\frac{\partial g}{\partial \bf x} = J[{\bf p}]^T {\bf q}({\bf x}) + J[{\bf q}]^T{\bf p}({\bf x}).

Here, J[{\bf p}]=\frac{\partial {\bf p}}{\partial {\bf x}^T} is the Jacobian. i.e. J_{i,j} = \frac{\partial p_i }{\partial x_j}

This rule is a generalization of the calculus 101 product rule, where g(x)=p(x) q(x) (everything scalar), and g'(x) = p'(x) q(x) + q'(x) p(x).

A different rule concerning exponentials is

Rule 2.

If {\bf f}({\bf x}) = \exp({\bf q}({\bf x})), then

J[{\bf f}]= \frac{\partial {\bf f}}{\partial {\bf x}^T} = \exp( {\bf q}({\bf x})) {\bf 1}^T \odot J[{\bf q}].

Here, \odot is the element-wise product. The strange product with {\bf 1}^T can be understood as “copying” the first vector. That is, {\bf x} {\bf 1}^T is a matrix where each column consists of \bf x. (The number of columns must be understood from context.)

Rule 3.

If {\bf f}({\bf x}) = {\bf p}({\bf x}) \odot {\bf q}({\bf x}), then

J[{\bf f}] ={\bf p}({\bf x}){\bf 1}^T \odot J[{\bf q}] + J[{\bf p}] \odot {\bf q}({\bf x}){\bf 1}^T

Surely there exists a large set of rules like this somewhere, but I have not been able to find them, despite needing things like this for years now. What I would really like to do is use a computer algebra system, such as Maple, Mathematica or Sage to do this, but that doesn’t seem possible at present. (It is suggested in that thread that Mathematica is up to the job, but I have tested it out found that not to be true.)


Random Image Segmentation

June 20, 2008

I’ve been playing around today with an image decomposition method. Given some observed image \bf y, one seeks an image \bf x so as to minimize

\sum_i (x_i - y_i)^2 + \lambda \sum_{(i,j)} |x_i-x_j|

Where the first sum is over all pixels i, and the second sum is over all neighboring pairs of pixels (i,j) (For simplicity here, I imagine that images are just long vectors, and the grid structure is only present in what pairs are considered neighbors). This minimization has the interesting property that, at a solution, neighboring pixels in \bf x often have the same value. Thus, one can see the optimization as segmenting the image into piecewise constant regions.

I tried using the (excellent) CVX toolbox to do the optimization. Though this worked, it was quite slow. I then derived a little iteratively reweighted least-squares algorithm that is much faster.

Anyway, the standard use of the decomposition is for “denoising”. Here is an example from an image in the Street Scenes database. At top is the original image \bf y, followed by the result \bf x. At the bottom, I use a random colors for each segment to give a better idea of the discontinuities.

Always, intensities are in the 0-1 interval, and only immediate horizontal and vertical pairs are considered neighbors.

\lambda=1

\lambda=1

However, more interesting (to me) images come from segmenting random images.

Uniform noise, \lambda=0.1

Uniform noise, \lambda=0.2

Uniform noise, \lambda=0.3

Uniform noise, \lambda=0.4

Uniform noise, \lambda=0.5

Uniform noise, \lambda=0.6

Gaussian noise, variance \sigma^2=.1, \lambda=0.3

Gaussian noise, variance \sigma^2=.1, \lambda=0.4

Gaussian noise, variance \sigma^2=.1, \lambda=0.5


Markov’s Inequality

June 19, 2008

While looking at Tao’s post on the law of large numbers, I claimed to Alap that Markov’s inequality was “obvious”. Asked to back that up, though, I couldn’t remember why it was obvious. Take some distribution p(x). For simplicity, assume this distribution is over the non-negative reals. Then, Markov’s inequality states, for \lambda > 0

P[x\geq\lambda] \leq \frac{E[x]}{\lambda}.

Though I find it more intuitive in the form

\lambda P[x \geq \lambda] \leq E[x].

Now, a proof is pretty trivial, basically by writing down the definitions of both sides.

E[x] = \int_{x\geq 0} x p(x) dx

\lambda P[x \geq \lambda] = \int_{x\geq\lambda} \lambda p(x) dx \leq \int_{x\geq\lambda} x p(x) dx

\leq \int_{x\geq 0} x p(x) dx

However, I think more intuition can be conveyed with pictures.

Matlab Code:

function markov_inequality

fsize = 16;
lambda = 5;

x = 0:.01:15;
p = .5*exp(-(x-3).^2) + exp(-(x-7).^2/4);
p=p/sum(p)*100;

figure(1)
plot(x,p,'k-','LineWidth',2)
xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])
legend('p(x)');
set(gca,'FontSize',fsize)

figure(2)
plot(x,p.*x,'k-','LineWidth',2);
xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])
legend('x p(x)');
set(gca,'FontSize',fsize)

figure(3)
plot(x,p.*x,'k-',x,p.*lambda,'r-','LineWidth',2);
xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])
hold on;
legend('x p(x)', '\lambda p(x)');
title('\lambda=5');
set(gca,'FontSize',fsize)

figure(4)
area(x,x.*p,'FaceColor',[.7 .7 .7],'LineWidth',2)
hold on;
good = find(x>=lambda);
area(x(good),p(good)*lambda,'FaceColor',[.8 0 0],'LineWidth',2)
hold off
legend('E[x]','\lambda P[x \geq \lambda]')
xlim([min(x) max(x)])
ylim([0 max(x.*p)+.25])
set(gca,'FontSize',fsize)

end

More Implicit Functions

June 19, 2008

(This time, including derivatives.) Closed form solutions are possible for some of these, I know.






…which of course has a constant derivative.

Matlab code:

function implicit_funcs
options = optimset('TolX',1e-15,'MaxFunEvals',1e5,'MaxIter',1e5);
fsize = 16;

e = 1e-5;

f1 = @(x) 5*x;
f2 = @(x) x;

xAR = -3:.05:3;
for n=1:length(xAR)
    x = xAR(n);
    myfun = @(y) 5*abs(y-f1(x))^2 + abs(y-f2(x))^2;
    y_star(n) = fminsearch(myfun,0,options);

    % approximate derivative with one sided finite diff
    x = xAR(n)+e;
    myfun = @(y) 5*abs(y-f1(x))^2 + abs(y-f2(x))^2;
    y_star2(n) = fminsearch(myfun,y_star(n),options);
end
figure(1)
plot(xAR,f1(xAR),'k-',xAR,f2(xAR),'k--',xAR,y_star,'r-','LineWidth',2)
legend('f1','f2','y*')
title('y*(x) = arg min_y = |y-f1(x)|^{2}+ (y-f2(x))^2','FontSize',fsize)
xlabel('x','FontSize',fsize)
ylabel('y','FontSize',fsize)
set(gca,'FontSize',fsize)

figure(2), plot(xAR,(y_star2-y_star)/e,'r-','LineWidth',2);
title('y*(x) = arg min_y = |y-f1(x)|^{2}+ (y-f2(x))^2','FontSize',fsize)
legend('dy*(x)/dx')
set(gca,'FontSize',fsize)
xlabel('x','FontSize',fsize)

end

Implicit Functions

June 19, 2008


Matlab code:

function implicit_funcs

fsize = 16;

f1 = @(x) x;
f2 = @(x) 1.5*x;

xAR = -2.5:.05:2.5;
for n=1:length(xAR)
    x = xAR(n);
    myfun = @(y) abs(y-f1(x)) + (y-f2(x))^2;
    y_star(n) = fminsearch(myfun,0);
end
plot(xAR,f1(xAR),'k-',xAR,f2(xAR),'k--',xAR,y_star,'r.','LineWidth',2)
legend('f1','f2','y*')
title('y*(x) = arg min_y = |y-f1(x)|+ (y-f2(x))^2','FontSize',fsize)
xlabel('x','FontSize',fsize)
ylabel('y','FontSize',fsize)
set(gca,'FontSize',fsize)

clear
figure
fsize = 16;

I=3;
f{1} = @(x) x;
f{2} = @(x) 1.5*x+1;
f{3} = @(x) .5*x+1;
function o=myfun2(y)
    o = 0;
    for i=1:I
        o = o + abs(y-f{i}(x));
    end
end
xAR = -5:.1:5;
for n=1:length(xAR)
    x = xAR(n);
    y_star(n) = fminsearch(@myfun2,0);
end
plot(xAR,f{1}(xAR),'k-',xAR,f{2}(xAR),'k-',xAR,f{3}(xAR),'k-',xAR,y_star,'r.','LineWidth',2)
legend('f1','f2','f3','y*')
title('y*(x) = arg min_y = \Sigma_i |y-f_i(x)|','FontSize',fsize)
xlabel('x','FontSize',fsize)
ylabel('y','FontSize',fsize)
set(gca,'FontSize',fsize)

clear
figure
fsize = 16;

I=3;
f{1} = @(x) x;
f{2} = @(x) 1.5*x+1;
f{3} = @(x) .5*x+1;
function o=myfun3(y)
    o = 0;
    for i=1:I
        o = o + (y-f{i}(x))^2;
    end
end
xAR = -5:.1:5;
for n=1:length(xAR)
    x = xAR(n);
    y_star(n) = fminsearch(@myfun3,0);
end
plot(xAR,f{1}(xAR),'k-',xAR,f{2}(xAR),'k-',xAR,f{3}(xAR),'k-',xAR,y_star,'r.','LineWidth',2)
legend('f1','f2','f3','y*')
title('y*(x) = arg min_y = \Sigma_i (y-f_i(x))^2','FontSize',fsize)
xlabel('x','FontSize',fsize)
ylabel('y','FontSize',fsize)
set(gca,'FontSize',fsize)

end