Course Content
Introduction to Neural Networks
Introduction to Neural Networks
Backpropagation Implementation
General Approach
In forward propagation, each layer l takes the outputs from the previous layer, al-1, as inputs and computes its own outputs. Therefore, the forward()
method of the Layer
class takes the vector of previous outputs as its only parameter, while the rest of the needed information is stored within the class.
In backward propagation, each layer l only needs dal to compute the respective gradients and return dal-1, so the backward()
method takes the dal vector as its parameter. The rest of the required information is already stored in the Layer
class.
Activation Functions Derivatives
Since derivatives of activation functions are needed for backpropagation, activation functions like ReLU and sigmoid should be structured as classes instead of standalone functions. This allows us to define both:
- The activation function itself (implemented via the
__call__()
method), allowing it to be applied to an input in the format:activation(z)
; - Its derivative (implemented via the
derivative()
method), which is used for backpropagation in the format:activation.derivative(z)
.
By structuring activation functions as objects, we can easily pass them to the Layer
class and use them dynamically.
ReLu
The derivative of ReLU activation function is as follows, where zi is an element of vector of pre-activations z
:
Sigmoid
The derivative of sigmoid activation function is as follows:
For both of these activation functions, we apply them to the entire vector z
, and the same goes for their derivatives. NumPy internally applies the operation to each element of the vector. For example, if the vector z
contains 3 elements, the derivation is as follows:
The backward() Method
The backward()
method is responsible for computing the gradients using the formulas below:
al-1 and zl are stored as the inputs
and outputs
attributes in the Layer
class, respectively. The activation function f is stores as the activation
attribute.
Once all the required gradients are computed, the weights and biases can be updated since they are no longer needed for further computation:
Therefore, learning_rate
(η) is another parameter of this method.
Thanks for your feedback!