Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Backward Propagation | Neural Network from Scratch
Introduction to Neural Networks
course content

Contenido del Curso

Introduction to Neural Networks

Introduction to Neural Networks

1. Concept of Neural Network
2. Neural Network from Scratch
3. Conclusion

book
Backward Propagation

Backward propagation (backprop) is the process of computing how the loss function changes with respect to each parameter in the network. The objective is to update the parameters in the direction that reduces the loss.

To achieve this, we use the gradient descent algorithm and compute the derivatives of the loss with respect to each layer's pre-activation values (raw output values before applying the activation function) and propagate them backward.

Each layer contributes to the final prediction, so the gradients must be computed in a structured manner:

  1. Perform forward propagation;

  2. Compute the derivative of the loss with respect to the output pre-activation;

  3. Propagate this derivative backward through the layers using the chain rule;

  4. Compute gradients for weights and biases to update them.

Note
Note

Gradients represent the rate of change of a function with respect to its inputs, meaning they are its derivatives. They indicate how much a small change in weights, biases, or activations affects the loss function, guiding the model's learning process through gradient descent.

Notation

To make explanation clearer, let's use the following notation:

  • WlW^l is the weight matrix of layer ll;

  • blb^l is the vector of biases of layer ll;

  • zlz^l is the vector of pre-activations of layer ll;

  • ala^l is the vector of activation of layer ll;

Therefore, setting a0a^0 to xx (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:

a0=x,......z1=W1a0+b1,zl=Wlal1+bl,zn=Wnan1+bn,a1=f1(z1),al=fl(zl),an=fn(zn),......y^=an.\begin{aligned} a^0 &= x, & &... & &...\\ z^1 &= W^1 a^0 + b^1, & z^l &= W^l a^{l-1} + b^l, & z^n &= W^n a^{n-1} + b^n,\\ a^1 &= f^1(z^1), & a^l &= f^l(z^l), & a^n &= f^n(z^n),\\ &... & &... & \hat y &= a^n. \end{aligned}

To describe backpropagation mathematically, we introduce the following notations:

  • dalda^l: derivative of the loss with respect to the activations at layer ll;

  • dzldz^l: derivative of the loss with respect to the pre-activations at layer ll (before applying the activation function);

  • dWldW^l: derivative of the loss with respect to the weights at layer ll;

  • dbldb^l: derivative of the loss with respect to the biases at layer ll.

Computing Gradients for the Output Layer

At the final layer nn, we first compute the gradient of the loss with respect to the activations of the output layer, danda^n. Next, using the chain rule, we compute the gradient of the loss with respect to output layer's pre-activations:

dzn=danfn(zn)dz^n = da^n \odot f'^n(z^n)
Note
Note

The \odot symbol represents element-wise multiplication. Since we are working with vectors and matrices, the usual multiplication symbol \cdot represents the dot product instead. fnf'^n is the derivative of the activation function of the output layer.

This quantity represents how sensitive the loss function is to changes in the output layer's pre-activation.

Once we have dzn\text d z^n, we compute gradients for the weights and biases:

dWn=dzn(an1)Tdbn=dzn\begin{aligned} dW^n &= dz^n \cdot (a^{n-1})^T\\ db^n &= dz^n \end{aligned}

where (an1)T(a^{n-1})^T is the transposed vector of activation from the previous layer. Given that the original vector is a nneurons×1n_{neurons} \times 1 vector, the transposed vector is 1×nneurons1 \times n_{neurons}.

To propagate this backward, we calculate the derivative of the loss with respect to the activations of the previous layer:

dan1=(Wn)Tdznda^{n-1} = (W^n)^T \cdot dz^n

Propagating Gradients to the Hidden Layers

For each hidden layer ll the procedure is the same. Given dalda^l:

  1. Compute the derivative of the loss with respect to the pre-activations;

  2. Compute the gradients for the weights and biases;

  3. Compute dal1da^{l-1} to propagate the derivative backward.

dzl=dalfl(zl)dWl=dzl(al1)Tdbl=dzldal1=(Wl)Tdzl\begin{aligned} dz^l &= da^l \odot f'^l(z^l)\\ dW^l &= dz^l \cdot (a^{l-1})^T\\ db^l &= dz^l\\ da^{l-1} &= (W^l)^T \cdot dz^l \end{aligned}

This step repeats until we reach the input layer.

Updating Weights and Biases

Once we have computed the gradients for all layers, we update the weights and biases using gradient descent:

Wl=WlαdWlbl=blαdbl\begin{aligned} W^l &= W^l - \alpha \cdot dW^l\\ b^l &= b^l - \alpha \cdot db^l \end{aligned}

where α\alpha is the learning rate, which controls how much we adjust the parameters.

question mark

During backpropagation, how does a neural network update its weights and biases to minimize the loss function?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 2. Capítulo 7

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

course content

Contenido del Curso

Introduction to Neural Networks

Introduction to Neural Networks

1. Concept of Neural Network
2. Neural Network from Scratch
3. Conclusion

book
Backward Propagation

Backward propagation (backprop) is the process of computing how the loss function changes with respect to each parameter in the network. The objective is to update the parameters in the direction that reduces the loss.

To achieve this, we use the gradient descent algorithm and compute the derivatives of the loss with respect to each layer's pre-activation values (raw output values before applying the activation function) and propagate them backward.

Each layer contributes to the final prediction, so the gradients must be computed in a structured manner:

  1. Perform forward propagation;

  2. Compute the derivative of the loss with respect to the output pre-activation;

  3. Propagate this derivative backward through the layers using the chain rule;

  4. Compute gradients for weights and biases to update them.

Note
Note

Gradients represent the rate of change of a function with respect to its inputs, meaning they are its derivatives. They indicate how much a small change in weights, biases, or activations affects the loss function, guiding the model's learning process through gradient descent.

Notation

To make explanation clearer, let's use the following notation:

  • WlW^l is the weight matrix of layer ll;

  • blb^l is the vector of biases of layer ll;

  • zlz^l is the vector of pre-activations of layer ll;

  • ala^l is the vector of activation of layer ll;

Therefore, setting a0a^0 to xx (the inputs), forward propagation in a perceptron with n layers can be described as the following sequence of operations:

a0=x,......z1=W1a0+b1,zl=Wlal1+bl,zn=Wnan1+bn,a1=f1(z1),al=fl(zl),an=fn(zn),......y^=an.\begin{aligned} a^0 &= x, & &... & &...\\ z^1 &= W^1 a^0 + b^1, & z^l &= W^l a^{l-1} + b^l, & z^n &= W^n a^{n-1} + b^n,\\ a^1 &= f^1(z^1), & a^l &= f^l(z^l), & a^n &= f^n(z^n),\\ &... & &... & \hat y &= a^n. \end{aligned}

To describe backpropagation mathematically, we introduce the following notations:

  • dalda^l: derivative of the loss with respect to the activations at layer ll;

  • dzldz^l: derivative of the loss with respect to the pre-activations at layer ll (before applying the activation function);

  • dWldW^l: derivative of the loss with respect to the weights at layer ll;

  • dbldb^l: derivative of the loss with respect to the biases at layer ll.

Computing Gradients for the Output Layer

At the final layer nn, we first compute the gradient of the loss with respect to the activations of the output layer, danda^n. Next, using the chain rule, we compute the gradient of the loss with respect to output layer's pre-activations:

dzn=danfn(zn)dz^n = da^n \odot f'^n(z^n)
Note
Note

The \odot symbol represents element-wise multiplication. Since we are working with vectors and matrices, the usual multiplication symbol \cdot represents the dot product instead. fnf'^n is the derivative of the activation function of the output layer.

This quantity represents how sensitive the loss function is to changes in the output layer's pre-activation.

Once we have dzn\text d z^n, we compute gradients for the weights and biases:

dWn=dzn(an1)Tdbn=dzn\begin{aligned} dW^n &= dz^n \cdot (a^{n-1})^T\\ db^n &= dz^n \end{aligned}

where (an1)T(a^{n-1})^T is the transposed vector of activation from the previous layer. Given that the original vector is a nneurons×1n_{neurons} \times 1 vector, the transposed vector is 1×nneurons1 \times n_{neurons}.

To propagate this backward, we calculate the derivative of the loss with respect to the activations of the previous layer:

dan1=(Wn)Tdznda^{n-1} = (W^n)^T \cdot dz^n

Propagating Gradients to the Hidden Layers

For each hidden layer ll the procedure is the same. Given dalda^l:

  1. Compute the derivative of the loss with respect to the pre-activations;

  2. Compute the gradients for the weights and biases;

  3. Compute dal1da^{l-1} to propagate the derivative backward.

dzl=dalfl(zl)dWl=dzl(al1)Tdbl=dzldal1=(Wl)Tdzl\begin{aligned} dz^l &= da^l \odot f'^l(z^l)\\ dW^l &= dz^l \cdot (a^{l-1})^T\\ db^l &= dz^l\\ da^{l-1} &= (W^l)^T \cdot dz^l \end{aligned}

This step repeats until we reach the input layer.

Updating Weights and Biases

Once we have computed the gradients for all layers, we update the weights and biases using gradient descent:

Wl=WlαdWlbl=blαdbl\begin{aligned} W^l &= W^l - \alpha \cdot dW^l\\ b^l &= b^l - \alpha \cdot db^l \end{aligned}

where α\alpha is the learning rate, which controls how much we adjust the parameters.

question mark

During backpropagation, how does a neural network update its weights and biases to minimize the loss function?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 2. Capítulo 7
some-alt