Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Backward Propagation | Neural Network from Scratch
Introduction to Neural Networks
course content

Contenido del Curso

Introduction to Neural Networks

Introduction to Neural Networks

1. Concept of Neural Network
2. Neural Network from Scratch
3. Model Training and Evaluation
4. Conclusion

book
Backward Propagation

Backward propagation (backprop) is the process of computing how the loss function changes with respect to each parameter in the network. The objective is to update the parameters in the direction that reduces the loss.

To achieve this, we use the gradient descent algorithm and compute the derivatives of the loss with respect to each layer's pre-activation values (raw output values before applying the activation function) and propagate them backward.

Each layer contributes to the final prediction, so the gradients must be computed in a structured manner:

  1. Perform forward propagation;
  2. Compute the derivative of the loss with respect to the output pre-activation;
  3. Propagate this derivative backward through the layers using the chain rule;
  4. Compute gradients for weights and biases to update them.

Notation

To make explanation clearer, let's use the following notation:

  • Wl is the weight matrix of layer l;
  • bl is the vector of biases of layer l;
  • zl is the vector of pre-activations of layer l;
  • al is the vector of activation of layer l;

Therefore, setting a0 to x (the inputs), forward propagation in a perceptron with n layers can be described as follows:

To describe backpropagation mathematically, we introduce the following notations:

  • dal: derivative of the loss with respect to the activations at layer l;
  • dzl: derivative of the loss with respect to the pre-activations at layer l (before applying the activation function);
  • dWl: derivative of the loss with respect to the weights at layer l;
  • dbl: derivative of the loss with respect to the biases at layer l.

Computing Gradients for the Output Layer

At the final layer n, we first compute the gradient of the loss with respect to the output layer's activations, dan. Next, using the chain rule, we compute the gradient of the loss with respect to output layer's pre-activations:

This quantity represents how sensitive the loss function is to changes in the output layer's pre-activation.

Once we have dzn, we compute gradients for the weights and biases:

where (an-1)T is the transposed vector of activation from the previous layer. Given that the original vector is a 1 x n_neurons vector, the transposed vector is n_neurons x 1.

To propagate this backward, we calculate the derivative of the loss with respect to the activations of the previous layer:

Propagating Gradients to the Hidden Layers

For each hidden layer l the procedure is the same. Given dal:

  1. Compute the derivative of the loss with respect to the pre-activations;
  2. Compute the gradients for the weights and biases;
  3. Compute dal-1 to propagate the derivative backward.

This step repeats until we reach the first layer.

Updating Weights and Biases

Once we have computed the gradients for all layers, we update the weights and biases using gradient descent:

where η is the learning rate, which controls how much we adjust the parameters.

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 2. Capítulo 7
We're sorry to hear that something went wrong. What happened?
some-alt