Contenido del Curso
Introduction to Neural Networks
Introduction to Neural Networks
Backward Propagation
Backward propagation (backprop) is the process of computing how the loss function changes with respect to each parameter in the network. The objective is to update the parameters in the direction that reduces the loss.
To achieve this, we use the gradient descent algorithm and compute the derivatives of the loss with respect to each layer's pre-activation values (raw output values before applying the activation function) and propagate them backward.
Each layer contributes to the final prediction, so the gradients must be computed in a structured manner:
- Perform forward propagation;
- Compute the derivative of the loss with respect to the output pre-activation;
- Propagate this derivative backward through the layers using the chain rule;
- Compute gradients for weights and biases to update them.
Notation
To make explanation clearer, let's use the following notation:
- Wl is the weight matrix of layer l;
- bl is the vector of biases of layer l;
- zl is the vector of pre-activations of layer l;
- al is the vector of activation of layer l;
Therefore, setting a0 to x (the inputs), forward propagation in a perceptron with n layers can be described as follows:
To describe backpropagation mathematically, we introduce the following notations:
- dal: derivative of the loss with respect to the activations at layer l;
- dzl: derivative of the loss with respect to the pre-activations at layer l (before applying the activation function);
- dWl: derivative of the loss with respect to the weights at layer l;
- dbl: derivative of the loss with respect to the biases at layer l.
Computing Gradients for the Output Layer
At the final layer n, we first compute the gradient of the loss with respect to the output layer's activations, dan. Next, using the chain rule, we compute the gradient of the loss with respect to output layer's pre-activations:
This quantity represents how sensitive the loss function is to changes in the output layer's pre-activation.
Once we have dzn, we compute gradients for the weights and biases:
where (an-1)T is the transposed vector of activation from the previous layer. Given that the original vector is a 1 x n_neurons
vector, the transposed vector is n_neurons x 1
.
To propagate this backward, we calculate the derivative of the loss with respect to the activations of the previous layer:
Propagating Gradients to the Hidden Layers
For each hidden layer l the procedure is the same. Given dal:
- Compute the derivative of the loss with respect to the pre-activations;
- Compute the gradients for the weights and biases;
- Compute dal-1 to propagate the derivative backward.
This step repeats until we reach the first layer.
Updating Weights and Biases
Once we have computed the gradients for all layers, we update the weights and biases using gradient descent:
where η is the learning rate, which controls how much we adjust the parameters.
¡Gracias por tus comentarios!