# Truncated Bi-Level Optimization

In 2012, I wrote a paper that I probably should have called “truncated bi-level optimization”.  I vaguely remembered telling the reviewers I would release some code, so I’m finally getting around to it.

The idea of bilevel optimization is quite simple.  Imagine that you would like to minimize some function $L(w)$.  However, $L$ itself is defined through some optimization.  More formally, suppose we would like to solve

$\min_{w,y} Q(y), \text{ s.t. } y = \arg\min_y E(y,w).$

Or, equivalently,

$\min_{w} L(w)=Q(y^*(w)),$

where $y^*$ is defined as $y^*(w) := \arg\min_y E(y,w)$.  This seems a little bit obscure at first, but actually comes up in several different ways in machine learning and related fields.

## Hyper-parameter learning

The first example would be in learning hyperparameters, such as regularization constants.  Inevitably in machine learning, one fits parameters parameters to optimize some tradeoff between the quality of a fit to training data and a regularization function being small.  Traditionally, the regularization constant is selected by optimizing on a training dataset with a variety of values, and then picking the one that performs best on a held-out dataset.  However, if there are a large number of regularization parameters, a high-dimensional grid-search will not be practical.  In the notation above, suppose that $w$ is a vector of regularization constants, and that $y$ are training parameters.  Let, $E$  be the regularized empirical risk on a training dataset, and let $Q$ be how well the parameters $y^*(w)$ perform on some held-out validation dataset.

## Energy-based models

Another example (and the one suggesting the notation) is an energy-based model.  Suppose that we have some “energy” function $E_x(y,w)$ which measures how well an output $y$ fits to an input $x$.  The energy is parametrized by $w$.  For a given training input/output pair $(\hat{x},\hat{y})$, we might have that $Q_{\hat{y}}(y^*(w))$ measures how how the predicted output $y^*$ compares to the true output $\hat{y}$, where $y^*(w)=\arg\max_y E_{\hat{x}}(y,w)$.

## Computing the gradient exactly

Even if we just have the modest goal of following the gradient of $L(w)$ to a local minimum, even computing the gradient is not so simple.  Clearly, even to evaluate $L(w)$ requires solving the “inner” minimization of $E$. It turns out that one can compute $\nabla_w L(w)$ through first solving the inner minimization, and then solving a linear system.

1. Input $w$
2. Solve $y^* \leftarrow \arg\min_y E(y,w)$.
3. Compute:
4.    (a) the loss $L(w)= Q(y^*)$
5.    (b) the gradient $g=\nabla_y Q(y^*)$
6.    (c) the Hessian $H=\frac{\partial^2 E(y^*,w)}{\partial w\partial w^T}$
7. Solve the linear system $z=H^{-1} g$.
8. Compute the parameter gradient $\nabla_w L(w) = - \frac{\partial^2 E(y^*,w)}{\partial w\partial y^T} z$
9. Return $L(w)$ and $\nabla_w L(w)$.

This looks a bit nasty, since we need to compute second-derivative matrices of $E$.  In fact, as long as one has a routine to compute $\nabla_y E$ and $\nabla_w E$, this can be avoided through Efficient Matrix-vector products.  This is essentially proposed by Do, Foo, and Ng in “Efficient multiple hyperparameter learning for log-linear models”.

Overall, this is a decent approach, but it can be quite slow, simply because one must solve an “inner” optimization in order to compute each gradient of the “outer” optimization.  Often, the inner-optimization needs to be solved to very high accuracy in order to estimate a gradient accurately enough to reduce $L(w)$— higher accuracy than is needed when one is simply using the predicted value $y^*$ itself.

## Truncated optimization

To get around this expense, a fairly obvious idea is to re-define the problem.  The slowness of exactly computing the gradient stems from needing to exactly solve the inner optimization.  Hence, perhaps we re-define the problem such that an inexact solve of the inner problem nevertheless yields an “exact” gradient?

Re-define the problem as solving

$\min_{w} L(w)=Q(y^*(w)), \text{ } y^*(w) := \text{opt-alg}_y E(y,w)$,

where $\text{opt-alg}$ denotes an approximate solve of the inner optimization.  In order for this to work, $\text{opt-alg}$ must be defined in such a way that $y^*(w)$ is a continuous function of $w$.  With standard optimization methods such as gradient descent or BFGS, this can be achieved by assuming there are a fixed number of iterations applied, with a fixed step-size.  Since each iteration of these algorithms is continuous, this clearly defines $y^*(w)$ as a continuous function.  Thus, in principle, it could be optimized efficiently through automatic differentiation of the code that optimizes $E$.  That’s fine in principle, but often inconvenient in practice.

It turns out, however, that one can derive “backpropagating” versions of algorithms like gradient descent, that take as input only a procedure to compute $E$ along with it’s first derivatives.  These algorithms can then produce the gradient of $L$ in the same time as automatic differentiation.

If the inner-optimization is gradient descent for $N$ steps with a step-size of $\lambda$, the algorithm to compute the loss is simple:

1. Input $w$
2. For $k=0,1,...,N-1$
3.     (a) $y_{k+1} \leftarrow y_k - \lambda \nabla_y E(y_k,w)$
4. Return $L(w) = Q(y_N)$

How to compute the gradient of this quantity?  The following algorithm does the trick.

1. $\overleftarrow{y_N} \leftarrow \nabla Q(y_N)$
2. $\overleftarrow{w} \leftarrow 0$
3. For $k=N-1,...,0$
4.     (a) $\overleftarrow{w} \leftarrow \overleftarrow{w} - \lambda \frac{\partial^2 E(y_k,w)}{\partial w \partial y^T} \overleftarrow{y_{k+1}}$
5.     (b) $\overleftarrow{y_k} \leftarrow \overleftarrow{y_{k+1}} - \lambda \frac{\partial^2 E(y_k,w)}{\partial y \partial y^T} \overleftarrow{y_{k+1}}$
6. Return $\nabla L = \overleftarrow{w}$.

Similar algorithms can be derived for the heavy-ball algorithm (with a little more complexity) and limited memory BFGS (with a lot more complexity).

## Code

So, finally, here is the code, and I’ll give a simple example of how to use it. There are just four simple files:

I think the meanings of this are pretty straightforward, so I’ll just quickly step through the demo here.  I’ll start off by grabbing taking one of Matlab’s built-in datasets (on cities) so that we are trying to predict a measure of crime from measures of climate, housing, health, transportation, arts, recreation, and economy, as well as a constant.  There are 329 data, total, which I split into a training set of size 40, a validation set of size 160, and a test set of size 129.

load cities ratings
X = ratings;
for i=1:size(X,2)
X(:,i) = X(:,i) - mean(X(:,i));
X(:,i) = X(:,i) / std( X(:,i));
end

% predict crime from climate, housing, health, trans, edu, arts, rec, econ,
Y = X(:,4);
X = [X(:,[1:3 5:9]) 1+0*X(:,1)];

p = randperm(length(Y));
X = X(p,:);
Y = Y(p);

whotrain = 1:50;
whoval   = 51:200;
whotest  = 201:329;

Xtrain = X(whotrain,:);
Xval   = X(whoval  ,:);
Xtest  = X(whotest ,:);
Ytrain = Y(whotrain);
Yval   = Y(whoval  );
Ytest  = Y(whotest );

Next, I’ll set up some simple constants that will be used later on, and define the optimization parameters for minFunc, that I will be using for the outer optimization. In particular, here I choose the inner optimization to use 20 iterations.

opt_iters = 20;
ndims = size(Xtrain,2); w0    = zeros(ndims,1); ndata = size(Xtrain,1); ndata_val = size(Xval,1);
options = []; options.Method  = 'gd'; options.LS      = 1; options.MaxIter = 100;

Now, I’ll define the training risk function ($E$ in the notation above). The computes the risk with a regularization constant of $a$, as well as derivatives. I’ll also define the validation risk ($Q$ in the notation above).

function [R dRdw dRdloga] = training_risk(w,loga)
a         = exp(loga);
R         = sum( (Xtrain*w - Ytrain).^2 )/ndata + a*sum(w.^2);
dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)  /ndata + 2*a*w;
dRda      = sum(w.^2);
dRdloga = dRda*a;
end

function [R g] = validation_risk(w)
R = sum( (Xval*w - Yval).^2 ) / ndata_val;
g = 2*Xval'*(Xval*w-Yval)     / ndata_val;
end

Now, before going any further, let’s do a traditional sweep through regularization constants to see what that looks like.

LAMBDA = -5:.25:2;
VAL_RISK = 0*LAMBDA;
for i=1:length(LAMBDA)
VAL_RISK(i) = back_lbfgs(@training_risk,@validation_risk,w0,LAMBDA(i),opt_iters);
end

Plotting, we get the following:

This is a reasonable looking curve. Instead, let’s ask the algorithm to find the constant by gradient descent.

eval = @(loga) back_lbfgs(@training_risk,@validation_risk,w0,loga,opt_iters);
loga = 0;
[loga fval] = minFunc(eval,loga,options);

Running the optimization, we see:

Iteration   FunEvals     Step Length    Function Val        Opt Cond
1          2     1.00000e+00     8.74176e-01     3.71997e-03
2          3     1.00000e+00     8.73910e-01     1.86453e-04
3          4     1.00000e+00     8.73909e-01     1.06619e-05
4          5     1.00000e+00     8.73909e-01     3.60499e-08

This leads to a regularizer of the form:

$0.860543 ||w||_2^2$.

We can plot this on the graph, and see it matches the result of cross-validation.

If we actually compute the test-set error, this is 0.708217.

OK, let’s be a little bit more adventurous, and use a third-order regularizer. This is done like so:

    function [R dRdw dRdloga] = training_risk2(w,loga)
a         = exp(loga(1));
b         = exp(loga(2));
R         = sum( (Xtrain*w - Ytrain).^2 ) / ndata + a*sum(abs(w).^2)     + b*sum(abs(w).^3);
dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)   / ndata + a*abs(w).^1.*sign(w) + b*abs(w).^2.*sign(w);
dRda      = sum(abs(w).^2);
dRdb      = sum(abs(w).^3);
dRdloga = [dRda*a; dRdb*b];
end

Running the optimization, we see:

Iteration   FunEvals     Step Length    Function Val        Opt Cond
1          2     1.00000e+00     8.74445e-01     1.17262e-02
2          3     1.00000e+00     8.73685e-01     3.21956e-03
3          4     1.00000e+00     8.73608e-01     1.41744e-03
4          5     1.00000e+00     8.73598e-01     8.20040e-04
5          6     1.00000e+00     8.73567e-01     1.39830e-03
6          7     1.00000e+00     8.73513e-01     2.52994e-03
7          8     1.00000e+00     8.73471e-01     1.77157e-03
...
23         28     1.00000e+00     8.70741e-01     8.42628e-06

With a final regularizer of the form

$0.000160 ||w|_2^2 + 5.117057 * ||w||_3^3$

and a hardly-improved test error of 0.679155.

Finally, let’s fit a fourth-order polynomial.

    function [R dRdw dRdloga] = training_risk3(w,loga)
a         = exp(loga);
R         = sum( (Xtrain*w - Ytrain).^2 ) / ndata;
dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)   / ndata;
for ii=1:length(a)
b=ii+1;
R    = R    + a(ii)*sum(abs(w).^b);
dRdw = dRdw + a(ii)*abs(w).^(b-1).*sign(w);
dRda(ii,1) = sum(abs(w).^b);
end
dRdloga = dRda.*a;
end

eval = @(loga) back_lbfgs(@training_risk3,@validation_risk,w0,loga,opt_iters);
loga = [0; 0; 0; 0];
loga = minFunc(eval,loga,options);

Running the optimization, we see

Iteration   FunEvals     Step Length    Function Val        Opt Cond
1          2     1.00000e+00     8.73982e-01     1.02060e-02
2          3     1.00000e+00     8.73524e-01     3.30445e-03
3          4     1.00000e+00     8.73447e-01     1.74779e-03
4          5     1.00000e+00     8.73435e-01     1.43655e-03
5          6     1.00000e+00     8.73345e-01     4.98340e-03
6          7     1.00000e+00     8.73295e-01     1.85535e-03
7          8     1.00000e+00     8.73231e-01     1.81136e-03
...
38         41     1.00000e+00     8.69758e-01     3.44848e-06

Yielding the (strange) regularizer

$0.000000 ||w||_2^2 + 0.000010 ||w||_3^3 + 24.310631 ||w||_4^4 + 0.325565 ||w||_5^5$

and a final test-error of 0.67187.

# Julia, Matlab, and C

Julia is a new language in the same arena as Matlab or R. I’ve had failed attempts to quit the Matlab addiction in the past, making me generally quite conservative about new platforms. However, I’ve recently been particularly annoyed by Matlab’s slow speed, evil license manager errors, restrictions on parallel processes, C++ .mex file pain, etc., and so I decided to check it out. It seems inevitable that Matlab will eventually displaced by something. The question is: is that something Julia?

“We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.”

Essentially, the goal seems to be a faster, freer Matlab that treats users like adults (macros!) and doesn’t require writing any .mex files in C++ or Fortan. Sounds too good to be true? I decided to try it out. My comparisons are to Matlab (R2012a) and C (gcc 4.2 with -O2 and -fno-builtin to prevent compile-time computations) all on a recent MacBook Pro (2.3 GHz Intel Core i7 w/ 8GB RAM).

Installation was trivial: I just grabbed a pre-compiled binary and started it up.

I deliberately used naive algorithms, since I am just testing raw speed. It should be a fair comparison, as long as the algorithm is constant. Please let me know about any bugs, though.

## First benchmark: Fibonnaci

% Matlab
function f=fib(n)
if n <= 2
f=1.0;
else
f=fib(n-1)+fib(n-2);
end
end

// C
double fib(int n){
if(n<=2)
return(1.0);
else
return(fib(n-2)+fib(n-1));
}

% julia
function fib(n)
if n <= 2
1.0
else
fib(n-1)+fib(n-2);
end
end


Clarity is basically a tie. Running them for n=30 we get:

time in matlab (fib): 14.344231
time in c (fib):       0.005887
time in julia (fib):   0.237832

## Second benchmark: Matrix Multiplication

This is a test of the naive O(N^3) matrix multiplication algorithm.

% matlab
function C=mmult(A,B,C)
[M,N] = size(A);
for i=1:M
for j=1:M
for k=1:M
C(i,j) = C(i,j) + A(i,k)*B(k,j);
end
end
end
end

// C
#define M 500
void mmult(double A[M][M],double B[M][M], double C[M][M]){
//double C[M][M];
int i,j,k;
for(i=0; i<M; i++)
for(j=0; j<M; j++){
C[i][j] = 0;
for(k=0; k<M; k++)
C[i][j] += A[i][k]*B[k][j];
}
}

# julia
function mmult(A,B)
(M,N) = size(A);
C = zeros(M,M);
for i=1:M
for j=1:M
for k=1:M
C[i,j] += A[i,k]*B[k,j];
end
end
end
C;
end


Here, I think that Matlab and Julia and a bit clearer, and Julia wins though the wonders of having “+=”. The timing results on 500×500 matrices are:

time in matlab (matmult): 1.229571
time in c (matmult):      0.157658
time in julia (matmult):  0.5029549

## Third Benchmark: numerical quadrature

Here, we attempt to calculate the integral $\int_{x=5}^{15} sin(x) dx$ by numerical quadrature, using a simple midpoint rule with computations at $10^7$ points.

% matlab
val = 0.0;
for x=lb:(ub-lb)/npoints:ub
val = val + sin(x)/npoints;
end
end

// C
double numquad(double lb,double ub,int npoints){
double val = 0.0;
int i;
for(i=0; i<=npoints; i++){
double x = lb + (ub-lb)*i/npoints;
val += sin(x)/npoints;
}
return(val);
}

# julia
val = 0.0
for x=lb:(ub-lb)/npoints:ub
val += sin(x)/npoints
end
val
end

The timings are:

time in matlab (numquad): 0.446151
time in c (numquad):      0.167112
time in julia (numquad):  0.256597

## Fourth Benchmark: Belief Propagation

Finally, I decided to try a little algorithm similar to what I actually tend to implement for my research. Roughly speaking, Belief Propagation is a repeated sequence of matrix multiplications, followed by normalization.

% matlab
function x=beliefprop(A,x,N)
for i=1:N
x = A*x;
x = x/sum(x);
end
end

// C
void beliefprop(double A[25][25], double x[25], int N){
int i, n, j;
double x2[25];
for(n=0; n<N; n++){
for(i=0; i<25; i++){
x2[i]=0;
for(j=0; j<25; j++)
x2[i] += A[i][j]*x[j];
}
for(i=0; i<25; i++)
x[i]=x2[i];
double mysum = 0;
for(i=0; i<25; i++)
mysum += x[i];
for(i=0; i<25; i++)
x[i] /= mysum;
}
return;
}

% julia
function beliefprop(A,x,N)
for i=1:N
x = A*x;
x /= sum(x);
end
x
end

Here, I think we can agree that Matlab and Julia are clearer. (Please don’t make fun of me for hardcoding the 25 dimensions in C.) Using a matrix package for C would probably improve clarity, but perhaps also slow things down. The results are:

time in matlab (beliefprop): 0.627478
time in c (beliefprop):      0.074355
time in julia (beliefprop):  0.376427

## Fifth Benchmark: BP in log-space

In practice, Belief Propagation is often implemented in log-space (to help avoid numerical under/over-flow.). To simulate an algorithm like this, I tried changing to propagation to take an exponent before multiplication, and a logarithm before storage.

% matlab
function x=beliefprop2(A,x,N)
for i=1:N
x = log(A*exp(x));
x = x - log(sum(exp(x)));
end
end

// C
void beliefprop2(double A[25][25], double x[25], int N){
int i, n, j;
double x2[25];
for(n=0; n<N; n++){
for(i=0; i<25; i++){
x2[i]=0;
for(j=0; j<25; j++)
x2[i] += A[i][j]*exp(x[j]);
}
for(i=0; i<25; i++)
x[i]=log(x2[i]);
double mysum = 0;
for(i=0; i<25; i++)
mysum += exp(x[i]);
double mynorm = log(mysum);
for(i=0; i<25; i++)
x[i] -= mynorm;
}
return;
}

# julia
function beliefprop2(A,x,N)
for i=1:N
x = log(A*exp(x));
x -= log(sum(exp(x)));
end
x
end

Life is too short to write C code like that when not necessary. But how about the speed, you ask?

time in matlab (beliefprop2): 0.662761
time in c (beliefprop2):      0.657620
time in julia (beliefprop2):  0.530220

## Sixth Benchmark: Markov Chain Monte Carlo

Here, I implement a simple Metropolis algorithm. For no particular reason, I use the two-dimensional distribution:

$p(x) \propto \exp(\sin(5 x_1) - x_1^2 - x_2^2)$

% matlab
function mcmc(x,N)
f = @(x) exp(sin(x(1)*5) - x(1)^2 - x(2)^2);
p = f(x);
for n=1:N
x2 = x + .01*randn(size(x));
p2 = f(x2);
if rand < p2/p
x = x2;
p = p2;
end
end
end

// C
double f(double *x){
return exp(sin(x[0]*5) - x[0]*x[0] - x[1]*x[1]);
}
#define pi 3.141592653589793
void mcmc(double *x,int N){
double p = f(x);
int n;
double x2[2];
for(n=0; n<N; n++){
// run Box_Muller to get 2 normal random variables
double U1 = ((double)rand())/RAND_MAX;
double U2 = ((double)rand())/RAND_MAX;
double R1 = sqrt(-2*log(U1))*cos(2*pi*U2);
double R2 = sqrt(-2*log(U1))*sin(2*pi*U2);
x2[0] = x[0] + .01*R1;
x2[1] = x[1] + .01*R2;
double p2 = f(x2);
if(((double)rand())/RAND_MAX< p2/p){
x[0] = x2[0];
x[1] = x2[1];
p = p2;
}
}
}

% julia
function mcmc(x,N)
f(x) = exp(sin(x[1]*5) - x[1]^2 - x[2]^2);
p = f(x);
for n=1:N
x2 = x + .01*randn(size(x));
p2 = f(x2);
if rand() < p2/p
x = x2;
p = p2;
end
end
end


Again, I think that C is far less clear than Matlab or Julia. The timings are:

time in matlab (mcmc): 7.747716
time in c (mcmc):      0.150776
time in julia (mcmc):  0.479628

## Table

All times are in seconds. (Lower is better.)

           Matlab     C     Julia
fib        14.344   0.005   0.237
matmult     1.229   0.157   0.502
numquad     0.446   0.167   0.256
bp          0.627   0.074   0.376
bp2         0.662   0.657   0.530
mcmc        7.747   0.150   0.479

## Conclusions

I’m sure all these programs can be sped up. In particular, I’d bet that an expert could optimize the C code to beat Julia on bp2 and mcmc. These are a test of “how fast can Justin Domke make these programs”, not the intrinsic capabilities of the languages. That said, Julia allows for optional type declarations. I did experiment with these but found absolutely no speed improvement. (Which is a good or a bad thing, depending on how you look at life.)

Another surprise to me was how often Matlab’s JIT managed a speed within a reasonable factor of C. (Except when it didn’t…)

The main thing that at Matlab programmer will miss in Julia is undoubtedly plotting. The Julia designers seem to understand the importance of this (“non-negotiable”). If Julia equalled Matlab’s plotting facilities, Matlab would be in real trouble!

Overall, I think that the killer features of freedom, kinda-sorta-C-like speed, and ease of use make Julia more likely as a Matlab-killer than other projects such as R, Sage, Octave, Scipy, etc. (Not to say that those projects have not succeeded in other ways!) Though Julia’s designers also seem to be targeting current R users, my guess is that they will have more success with Matlab folks in the short term, since most Matlab functionality (other than plotting) already exists, while reproducing R’s statistical libraries will be quite difficult. I also think that Julia would be very attractive to current users of languages like Lush. Just to never write another .mex file, I’ll very seriously consider Julia for new projects. Other benefits such as macros, better parallelism support are just bonuses. As Julia continues to develop, it will become yet more attractive.

There was an interesting discussion on Lambda the Ultimate about Julia back when it was announced

# CRF Toolbox Updated

I updated the code for my Graphical Models / Conditional Random Fields toolbox This is a Matlab toolbox, though almost all the real work is done in compiled C++ for efficiency. The main improvements are:

• Lots of bugfixes.
• Various small improvements in speed.
• A unified CRF training interface to make things easier for those not training on images
• Binaries are now provided for Linux as well as OS X.
• The code for inference and learning using TRW is now multithreaded, using openmp.
• Switched to using a newer version of Eigen

There is also far more detailed examples, including a full tutorial of how to train a CRF to do “semantic segmentation” on the Stanford Backgrounds dataset. Just using simple color, position, and Histogram of Gradient features, the error rates are 23%, which appear to be state of the art (and better than previous CRF based approaches.) It takes about 90 minutes to train on my 8-core machine, and processes new frames in a little over a second each.

For fun, I also ran this model on a video of someone driving from Alexandria into Georgetown. You can see that the results are far from perfect but are reasonably good. (Notice it successfully distinguishes trees and grass at 0:12)

I’m keen to have others use the code (what with the hundreds of hours spent writing it), so please send email if you have any issues.

# Graphical Models Toolbox

I’m releasing code for a “toolbox” of code for learning and inference with graphical models. It is focused on parameter learning using marginalization in the high-treewidth setting. Though the code is, in principle, domain independent, I’ve developed it with vision problems in mind. This means that the code is A) efficient (all the inference algorithms are implemented in C++) and B) can handle arbitrary graph structures.

There are, at present, a bunch of limitations:

• All the inference algorithms are for marginal inference. No MAP inference, at all.
• The code handles pairwise graphs only
• All variables must have the same number of possible values.
• For tree-reweighted belief propagation, a single edge appearance probability must be used for all edges

For vision, these are usually no big deal. (Except if you are a MAP inference person. But that is not advisable.) In other domains, though, these might be showstoppers.

The code can be used in a bunch of different ways, depending on if you are looking for a specific tool to use, or a large framework.

• Just use the low-level [Inference] algorithms, namely A) Tree-Reweighted Belief propagation + variants (Loopy BP, TRW-S) or B) Mean-field. Take care of everything else yourself.
• Use the [Differentiation] methods (back-TRW or implicit differentiation) to calculate parameter gradients by providing your own loss functions. Do everything else on your own.
• Use the [Loss] methods (em, implicit_loss) to calculate parameter gradients by providing a true vector x and a loss name (univariate likelihood, clique likelihood, etc.) Unlike the above usages, these methods explicitly consider the conditional learning setting where one has an input and an output.
• Use the [CRF] methods to calculate calculate almost everything (deal with parameter ties for a specific type of model, etc.) These methods consider specific classes of CRFs and given and input, output, loss function, inference method, etc. give the parameter gradient. Employing this gradient in a learning framework is quite straightforward.