Learn Challenge: Implementing Feed-Forward Networks

Section 2. Chapter 3

single

Swipe to show menu

As you explore the Transformer architecture for natural language processing, you encounter a crucial component inside each Transformer block: the position-wise feed-forward network (FFN). After the self-attention mechanism processes input representations, the FFN further transforms these representations at each position in the sequence, independently of other positions. This means that for every token in a sentence, the same small neural network is applied, allowing the model to introduce additional non-linearity and learn more complex patterns from the text. The FFN is essential for capturing relationships and refining the information encoded by self-attention, especially when dealing with the subtleties and ambiguities of human language.


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np

def relu(x):
    return np.maximum(0, x)

class PositionWiseFeedForward:
    def __init__(self, d_model, d_ff):
        # Initialize weights and biases for two linear layers
        self.W1 = np.random.randn(d_model, d_ff) * 0.01
        self.b1 = np.zeros((1, d_ff))
        self.W2 = np.random.randn(d_ff, d_model) * 0.01
        self.b2 = np.zeros((1, d_model))
    
    def __call__(self, x):
        # x shape: (batch_size, seq_len, d_model)
        # Apply first linear layer and ReLU activation
        out1 = relu(np.matmul(x, self.W1) + self.b1)
        # Apply second linear layer
        out2 = np.matmul(out1, self.W2) + self.b2
        return out2

# Example usage:
batch_size = 2
seq_len = 4
d_model = 8
d_ff = 16

# Example input: random tensor simulating text representations
x = np.random.randn(batch_size, seq_len, d_model)

ffn = PositionWiseFeedForward(d_model, d_ff)
output = ffn(x)
print("Output shape:", output.shape)

In the code above, you see a simple implementation of a position-wise feed-forward network using numpy. The network consists of two linear transformations (matrix multiplications), separated by a ReLU activation function.

Definition

ReLU activation function: The ReLU (Rectified Linear Unit) activation is defined as relu(x) = max(0, x). It sets all negative values to zero and keeps positive values unchanged. ReLU is used in the feed-forward network to introduce non-linearity, allowing the network to learn more complex patterns from the data.

The first linear layer projects the input from d_model dimensions (the size of each token's embedding) to a higher-dimensional space d_ff, allowing the model to capture more complex features. The second linear layer projects the result back to the original d_model size. Notice that this network is applied independently to each position in the sequence, which means that the transformation for one token does not directly affect others. This independence helps the model process each token's representation in parallel, making Transformers highly efficient for text data.

Task

Swipe to start coding

Implement a position-wise feed-forward network function using numpy.

Define a function position_wise_ffn(x, W1, b1, W2, b2) that takes: