single
Challenge: Implementing Feed-Forward Networks
Swipe to show menu
As you explore the Transformer architecture for natural language processing, you encounter a crucial component inside each Transformer block: the position-wise feed-forward network (FFN). After the self-attention mechanism processes input representations, the FFN further transforms these representations at each position in the sequence, independently of other positions. This means that for every token in a sentence, the same small neural network is applied, allowing the model to introduce additional non-linearity and learn more complex patterns from the text. The FFN is essential for capturing relationships and refining the information encoded by self-attention, especially when dealing with the subtleties and ambiguities of human language.
123456789101112131415161718192021222324252627282930313233import numpy as np def relu(x): return np.maximum(0, x) class PositionWiseFeedForward: def __init__(self, d_model, d_ff): # Initialize weights and biases for two linear layers self.W1 = np.random.randn(d_model, d_ff) * 0.01 self.b1 = np.zeros((1, d_ff)) self.W2 = np.random.randn(d_ff, d_model) * 0.01 self.b2 = np.zeros((1, d_model)) def __call__(self, x): # x shape: (batch_size, seq_len, d_model) # Apply first linear layer and ReLU activation out1 = relu(np.matmul(x, self.W1) + self.b1) # Apply second linear layer out2 = np.matmul(out1, self.W2) + self.b2 return out2 # Example usage: batch_size = 2 seq_len = 4 d_model = 8 d_ff = 16 # Example input: random tensor simulating text representations x = np.random.randn(batch_size, seq_len, d_model) ffn = PositionWiseFeedForward(d_model, d_ff) output = ffn(x) print("Output shape:", output.shape)
In the code above, you see a simple implementation of a position-wise feed-forward network using numpy. The network consists of two linear transformations (matrix multiplications), separated by a ReLU activation function.
ReLU activation function: The ReLU (Rectified Linear Unit) activation is defined as relu(x) = max(0, x). It sets all negative values to zero and keeps positive values unchanged. ReLU is used in the feed-forward network to introduce non-linearity, allowing the network to learn more complex patterns from the data.
The first linear layer projects the input from d_model dimensions (the size of each token's embedding) to a higher-dimensional space d_ff, allowing the model to capture more complex features. The second linear layer projects the result back to the original d_model size. Notice that this network is applied independently to each position in the sequence, which means that the transformation for one token does not directly affect others. This independence helps the model process each token's representation in parallel, making Transformers highly efficient for text data.
Swipe to start coding
Implement a position-wise feed-forward network function using numpy.
Define a function position_wise_ffn(x, W1, b1, W2, b2) that takes:
x: anumpyarray of shape(batch_size, seq_len, d_model);W1: anumpyarray of shape(d_model, d_ff);b1: anumpyarray of shape(1, d_ff);W2: anumpyarray of shape(d_ff, d_model);b2: anumpyarray of shape(1, d_model).
For each position in the sequence, apply:
- A linear transformation:
out1 = x @ W1 + b1; - A ReLU activation:
out1 = relu(out1); - A second linear transformation:
out2 = out1 @ W2 + b2.
Return the output array out2 with shape (batch_size, seq_len, d_model).
Solution
Thanks for your feedback!
single
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat