Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn How to Stack Transformer Blocks | Building Transformer Components
Transformers for Natural Language Processing

bookHow to Stack Transformer Blocks

Swipe to show menu

123456789101112131415161718192021222324252627282930313233343536373839404142434445
import torch import torch.nn as nn class EncoderBlock(nn.Module): def __init__(self, embed_dim, num_heads, ff_dim): super().__init__() self.attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True) self.norm1 = nn.LayerNorm(embed_dim) self.ff = nn.Sequential( nn.Linear(embed_dim, ff_dim), nn.ReLU(), nn.Linear(ff_dim, embed_dim) ) self.norm2 = nn.LayerNorm(embed_dim) def forward(self, x): attn_output, _ = self.attn(x, x, x) x = self.norm1(x + attn_output) ff_output = self.ff(x) x = self.norm2(x + ff_output) return x class StackedEncoder(nn.Module): def __init__(self, num_layers, embed_dim, num_heads, ff_dim): super().__init__() self.layers = nn.ModuleList([ EncoderBlock(embed_dim, num_heads, ff_dim) for _ in range(num_layers) ]) def forward(self, x): for layer in self.layers: x = layer(x) return x # Example usage: stack 4 encoder blocks for input text embeddings embed_dim = 64 num_heads = 4 ff_dim = 256 num_layers = 4 stacked_encoder = StackedEncoder(num_layers, embed_dim, num_heads, ff_dim) input_tensor = torch.rand(2, 10, embed_dim) # batch_size=2, seq_len=10 output = stacked_encoder(input_tensor) print("Stacked encoder output shape:", output.shape)
copy

When you process text with a Transformer, you often use several encoder or decoder blocks stacked on top of each other. In the code above, you see how to create a stack of encoder blocks using PyTorch. Each EncoderBlock contains multi-head self-attention, a feed-forward network, and layer normalization. The StackedEncoder class builds a sequence of these blocks, passing the output of one block as the input to the next. This stacking allows the model to learn increasingly complex representations of the text at each layer. For NLP tasks, such as text classification or translation, stacking blocks helps capture deeper relationships and context in the data. The initial input - text converted to embeddings - flows through each encoder block, with each block refining the representation based on both the input and the context learned in previous layers.

Transformer architectures can be grouped into three main types, each suited to different NLP tasks:

Encoder-Only Models

  • Structure: Composed of a stack of encoder blocks.
  • Typical Use-Cases: These models are ideal for understanding or analyzing text. Common applications include text classification, sentiment analysis, and named entity recognition. A well-known example of this type is BERT.

Decoder-Only Models

  • Structure: Composed of a stack of decoder blocks.
  • Typical Use-Cases: These models are designed for generating text, such as in language modeling and code completion tasks. GPT is a prominent example of a decoder-only model.

Encoder-Decoder Models

  • Structure: Combine encoder blocks that process the input and decoder blocks that generate the output, with information flowing from the encoder to the decoder.
  • Typical Use-Cases: These are used for sequence-to-sequence tasks where both the input and output are sequences. Applications include machine translation, summarization, and question answering. T5 and the original Transformer are examples of encoder-decoder architectures.

In summary, the choice of architecture determines the kinds of NLP problems you can solve: encoder-only models for analysis, decoder-only models for generation, and encoder-decoder models for tasks that require mapping one sequence to another.

question mark

Which Transformer architecture would you typically use for a machine translation task?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 6

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 6
some-alt