DerivativeCheck fails with minFunc

2023-03-18 22:23 问答作者：

I'm trying to train a single layer of an autoencoder using minFunc, and while the cost function appears to decrease, when enabled, the DerivativeCheck fails. The code I'm using is as close to textbook values as possible, though extremely simplified.

The loss function I'm using is the squared-error:

$ J(W; x) = \frac{1}{2}||a^{l} - x||^2 $

with $a^{l}$ equal to $\sigma(W^{T}x)$, where $\sigma$ is the sigmoid function. The gradient should therefore be:

$ \delta = (a^{l} - x)*a^{l}(1 - a^{l}) $

$ \nabla_{W} = \delta(a^{l-1})^T $

Note, that to simplify things, I've left off the bias altogether. While this will cause poor performance, it shouldn't affect the gradient check, as I'm only looking at the weight matrix. Additionally, I've tied the encoder and decoder matrices, so there is effectively a single weight matrix.

The code I'm using for the loss function is (edit: I've vectorized the loop I had and cleaned code up a little):

% loss function passed to minFunc
function [ loss, grad ] = calcLoss(theta, X, nHidden)
  [nInstances, nVars] = size(X);

  % we get the variables a single vector, so need to roll it into a weight matrix
  W = reshape(theta(1:nVars*nHidden), nVars, nHidden);
  Wp = W; % tied weight matrix

  % encode each example (nInstances)
  hidden = sigmoid(X*W);

  % decode each sample (nInstances)
  output = sigmoid(hidden*Wp);

  % loss function: sum(-0.5.*(x - output).^2)
  % derivative of loss: -(x - output)*f'(o)
  % if f is sigmoid, then f'(o) = output.*(1-output)
  diff = X - output;
  error = -diff .* output .* (1 - output);
  dW = hidden*error';

  loss = 0.5*sum(diff(:).^2, 2) ./ nInstances;

  % need to unroll gradient matrix back into a single vector
  grad = dW(:) ./ nInstances;
end

Below is the code I use to run开发者_Python百科 the optimizer (for a single time, as the runtime is fairly long with all training samples):

examples = 5000;
fprintf('loading data..\n');
images = readMNIST('train-images-idx3-ubyte', examples) / 255.0;

data = images(:, :, 1:examples);

% each row is a different training sample
X = reshape(data, examples, 784);

% initialize weight matrix with random values
% W: (R^{784} -> R^{10}), W': (R^{10} -> R^{784})
numHidden = 10; % NOTE: this is extremely small to speed up DerivativeCheck
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
W = low + (high-low)*rand(numVisible, numHidden);

% run optimization
options = {};
options.Display = 'iter';
options.GradObj = 'on';
options.MaxIter = 10;
mfopts.MaxFunEvals = ceil(options.MaxIter * 2.5);
options.DerivativeCheck = 'on';
options.Method = 'lbfgs';    
[ x, f, exitFlag, output] = minFunc(@calcLoss, W(:), options, X, numHidden);

The results I get with the DerivitiveCheck on are generally less than 0, but greater than 0.1. I've tried similar code using batch gradient descent, and get slightly better results (some are < 0.0001, but certainly not all).

I'm not sure if I made either a mistake with my math or code. Any help would be greatly appreciated!

update

I discovered a small typo in my code (which doesn't appear in the code below) causing exceptionally bad performance. Unfortunately, I'm still getting getting less-than-good results. For example, comparison between the two gradients:

calculate     check
0.0379        0.0383
0.0413        0.0409
0.0339        0.0342
0.0281        0.0282
0.0322        0.0320

with differences of up to 0.04, which I'm assuming is still failing.

Okay, I think I might have solved the problem. Generally the differences in the gradients are < 1e-4, though I do have at least one which is 6e-4. Does anyone know if this is still acceptable?

To get this result, I rewrote the code and without tying the weight matrices (I'm not sure if doing so will always cause the derivative check to fail). I've also included biases, as they didn't complicate things too badly.

Something else I realized when debugging is that it's really easy to make a mistake in the code. For example, it took me a while to catch:

grad_W1 = error_h*X';

instead of:

grad_W1  = X*error_h';

While the difference between these two lines is just the transpose of grad_W1, because of the requirement of packing/unpacking the parameters into a single vector, there's no way for Matlab to complain about grad_W1 being the wrong dimensions.

I've also included my own derivative check which gives slightly different answers than minFunc's (my deriviate check gives differences that are all below 1e-4).

fwdprop.m:

function [ hidden, output ] = fwdprop(W1, bias1, W2, bias2, X)
  hidden = sigmoid(bsxfun(@plus, W1'*X, bias1));
  output = sigmoid(bsxfun(@plus, W2'*hidden, bias2));
 end

calcLoss.m:

function [ loss, grad ] = calcLoss(theta, X, nHidden)
  [nVars, nInstances] = size(X);
  [W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
  err = output - X;
  delta_o = err .* output .* (1.0 - output);
  delta_h = W2*delta_o .* hidden .* (1.0 - hidden);

  grad_W1 = X*delta_h';
  grad_bias1 = sum(delta_h, 2);
  grad_W2 = hidden*delta_o';
  grad_bias2 = sum(delta_o, 2);

  loss = 0.5*sum(err(:).^2);
  grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
end

unpackParams.m:

function [ W1, bias1, W2, bias2 ] = unpackParams(params, nVisible, nHidden)
  mSize = nVisible*nHidden;

  W1 = reshape(params(1:mSize), nVisible, nHidden);
  offset = mSize;    

  bias1 = params(offset+1:offset+nHidden);
  offset = offset + nHidden;

  W2 = reshape(params(offset+1:offset+mSize), nHidden, nVisible);
  offset = offset + mSize;

  bias2 = params(offset+1:end);
end

packParams.m

function [ params ] = packParams(W1, bias1, W2, bias2)
  params = [W1(:); bias1; W2(:); bias2(:)];
end

checkDeriv.m:

function [check] = checkDeriv(X, theta, nHidden, epsilon)
  [nVars, nInstances] = size(X);

  [W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
  err = output - X;
  delta_o = err .* output .* (1.0 - output);
  delta_h = W2*delta_o .* hidden .* (1.0 - hidden);

  grad_W1 = X*delta_h';
  grad_bias1 = sum(delta_h, 2);
  grad_W2 = hidden*delta_o';
  grad_bias2 = sum(delta_o, 2);

  check = zeros(size(theta, 1), 2);
  grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
  for i = 1:size(theta, 1)
      Jplus = calcHalfDeriv(X, theta(:), i, nHidden, epsilon);
      Jminus = calcHalfDeriv(X, theta(:), i, nHidden, -epsilon);

      calcGrad = (Jplus - Jminus)/(2*epsilon);
      check(i, :) = [calcGrad grad(i)];
  end
end

checkHalfDeriv.m:

function [ loss ] = calcHalfDeriv(X, theta, i, nHidden, epsilon)
  theta(i) = theta(i) + epsilon;

  [nVisible, nInstances] = size(X);
  [W1, bias1, W2, bias2] = unpackParams(theta, nVisible, nHidden);
  [hidden, output] = fwdprop(W1, bias1, W2, bias2, X);

  err = output - X;
  loss = 0.5*sum(err(:).^2);
end

Update

Okay, I've also figured out why tying the weights was causing issues. I wanted to go down to just [ W1; bias1; bias2 ] since W2 = W1'. This way I could simply recreate W2 by looking at W1. However, because the values of $\theta$ are changed by epsilon, this was in effect changing both matrices at the same time. The proper solution is to simply pass W1 as a separate parameter while at the same time reducing $\theta$.

Update 2

Okay, this is what I get for posting too late at night. While the first update does indeed cause things to pass correctly, it's not the correct solution.

I think the correct thing to do is to actually calculate the gradients for W1 and W2, and then set the final gradient of W1 to grad_W1 to grad_W2. The hand-waving argument is that since the weight matrix is acting to both encode and decode, its weights must be affected by both gradients. I haven't thought through the actual theoretical ramifications of this yet, however.

If I run this using my own derivative check, it passes the 10e-4 threshold. It does much better than before with minFunc's derivative check, though still worse than if I don't tie the weights.

继续阅读：machine-learning mathematical-optimization

DerivativeCheck fails with minFunc

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？