Learn How to Stack Transformer Blocks | Building Transformer Components

Swipe to show menu


              123456789101112131415161718192021222324252627282930313233343536373839404142434445
            
import torch
import torch.nn as nn

class EncoderBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim)
        )
        self.norm2 = nn.LayerNorm(embed_dim)
        
    def forward(self, x):
        attn_output, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_output)
        ff_output = self.ff(x)
        x = self.norm2(x + ff_output)
        return x

class StackedEncoder(nn.Module):
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim):
        super().__init__()
        self.layers = nn.ModuleList([
            EncoderBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers)
        ])
        
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

# Example usage: stack 4 encoder blocks for input text embeddings
embed_dim = 64
num_heads = 4
ff_dim = 256
num_layers = 4

stacked_encoder = StackedEncoder(num_layers, embed_dim, num_heads, ff_dim)
input_tensor = torch.rand(2, 10, embed_dim)  # batch_size=2, seq_len=10

output = stacked_encoder(input_tensor)
print("Stacked encoder output shape:", output.shape)

When you process text with a Transformer, you often use several encoder or decoder blocks stacked on top of each other. In the code above, you see how to create a stack of encoder blocks using PyTorch. Each EncoderBlock contains multi-head self-attention, a feed-forward network, and layer normalization. The StackedEncoder class builds a sequence of these blocks, passing the output of one block as the input to the next. This stacking allows the model to learn increasingly complex representations of the text at each layer. For NLP tasks, such as text classification or translation, stacking blocks helps capture deeper relationships and context in the data. The initial input - text converted to embeddings - flows through each encoder block, with each block refining the representation based on both the input and the context learned in previous layers.

Transformer architectures can be grouped into three main types, each suited to different NLP tasks:

Encoder-Only Models

Structure: Composed of a stack of encoder blocks.
Typical Use-Cases: These models are ideal for understanding or analyzing text. Common applications include text classification, sentiment analysis, and named entity recognition. A well-known example of this type is BERT.

Decoder-Only Models

Structure: Composed of a stack of decoder blocks.
Typical Use-Cases: These models are designed for generating text, such as in language modeling and code completion tasks. GPT is a prominent example of a decoder-only model.

Encoder-Decoder Models

Structure: Combine encoder blocks that process the input and decoder blocks that generate the output, with information flowing from the encoder to the decoder.
Typical Use-Cases: These are used for sequence-to-sequence tasks where both the input and output are sequences. Applications include machine translation, summarization, and question answering. T5 and the original Transformer are examples of encoder-decoder architectures.

In summary, the choice of architecture determines the kinds of NLP problems you can solve: encoder-only models for analysis, decoder-only models for generation, and encoder-decoder models for tasks that require mapping one sequence to another.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 6