Neural Network Backpropagation?
Can anyone recommend a website or give me a brief of how backpropagation is implemented in a NN? I understand the basic concept, but I'm unsure of how to go about writing the code.
Many of sources I've found simply show equations without giving any explanation of why they're doing it, and the variable nam开发者_如何学Goes make it difficult to find out.
Example:
void bpnn_output_error(delta, target, output, nj, err)
double *delta, *target, *output, *err;
int nj;
{
int j;
double o, t, errsum;
errsum = 0.0;
for (j = 1; j <= nj; j++) {
o = output[j];
t = target[j];
delta[j] = o * (1.0 - o) * (t - o);
errsum += ABS(delta[j]);
}
*err = errsum;
}
In that example, can someone explain the purpose of
delta[j] = o * (1.0 - o) * (t - o);
Thanks.
The purpose of
delta[j] = o * (1.0 - o) * (t - o);
is to find the error of an output node in a backpropagation network.
o represents the output of the node, t is the expected value of output for the node.
The term, (o * (1.0 - o), is the derivative of a common transfer function used, the sigmoid function. (Other transfer functions are not uncommon, and would require a rewrite of the code that has the sigmoid first derivative instead. A mismatch between function and derivative would likely mean that training would not converge.) The node has an "activation" value that is fed through a transfer function to obtain the output o, like
o = f(activation)
The main thing is that backpropagation uses gradient descent, and the error gets backward-propagated by application of the Chain Rule. The problem is one of credit assignment, or blame if you will, for the hidden nodes whose output is not directly comparable to the expected value. We start with what is known and comparable, the output nodes. The error is taken to be proportional to the first derivative of the output times the raw error value between the expected output and actual output.
So more symbolically, we'd write that line as
delta[j] = f'(activation_j) * (t_j - o_j)
where f is your transfer function, and f' is the first derivative of it.
Further back in the hidden layers, the error at a node is its estimated contribution to the errors found at the next layer. So the deltas from the succeeding layer are multiplied by the connecting weights, and those products are summed. That sum is multiplied by the first derivative of the activation of the hidden node to get the delta for a hidden node, or
delta[j] = f'(activation_j) * Sum(delta[k] * w_jk)
where j now references a hidden node and k a node in a succeeding layer.
(t-o)
is the error in the output of the network since t
is the target output and o
is the actual output. It is being stored in a normalized form in the delta
array. The method used to normalize depends on the implementation and the o * ( 1.0 - o )
seems to be doing that (I could be wrong about that assumption).
This normalized error is accumulated for the entire training set to judge when the training is complete: usually when errsum
is below some target threshold.
Actually, if you know the theory, the programs should be easy to understand. You can read the book and do some simple samples using a pencil to figure out the exact steps of propagation. This is a general principle for implementing numerical programs, you must know the very details in small cases.
If you know Matlab, I'd suggest you to read some Matlab source code (e.g. here), which is easier to understand than C.
For the code in your question, the names are quite self-explanatory, output may be the array of your prediction, target may be the array of training labels, delta is the error between prediction and true values, it also serves as the value to be updated into the weight vector.
Essentially, what backprop does is run the network on the training data, observe the output, then adjust the values of the nodes, going from the output nodes back to the input nodes iteratively.
精彩评论