Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn How Layer Normalization Stabilizes Transformers | Building Transformer Components
Transformers for Natural Language Processing

bookHow Layer Normalization Stabilizes Transformers

Swipe to show menu

Modern deep learning models rely on normalization techniques to help them train efficiently and accurately. In natural language processing (NLP), where input sequences can vary in length and structure, normalization is especially important for stabilizing the learning process. Two common normalization methods are batch normalization and layer normalization, but they serve different purposes and are suited to different types of data.

Batch Normalization
expand arrow

Batch normalization computes the mean and variance of each feature across the entire batch of inputs. This approach works well for computer vision tasks, where each image in a batch is typically of the same size and structure. However, in NLP tasks, especially with variable-length sequences, batch normalization can introduce unwanted dependencies between samples in a batch and may not handle varying sequence lengths gracefully.

Layer Normalization
expand arrow

Layer normalization, on the other hand, normalizes input features within each data point (such as each token embedding in a sentence) independently of other data points in the batch. This makes it much more suitable for NLP and sequence modeling. By standardizing the summed inputs to a neuron within a layer, layer normalization ensures that each token's representation is on a similar scale, regardless of its position or the batch's composition. This helps Transformers train stably and represent text more effectively, especially when dealing with long or complex sentences.

123456789101112131415161718192021
import numpy as np def layer_norm(x, epsilon=1e-5): """ Applies layer normalization to a 2D numpy array x. Each row is normalized independently. """ mean = np.mean(x, axis=1, keepdims=True) variance = np.var(x, axis=1, keepdims=True) normalized = (x - mean) / np.sqrt(variance + epsilon) # Optional: learnable scale and bias (gamma and beta) gamma = np.ones_like(mean) beta = np.zeros_like(mean) return gamma * normalized + beta # Example: normalize a batch of 3 token embeddings (rows) embeddings = np.array([[2.0, 4.0, 6.0], [1.0, 3.0, 5.0], [0.0, 0.0, 0.0]]) normalized_embeddings = layer_norm(embeddings) print(normalized_embeddings)
copy

Picture each row in your input as a unique token embedding - like a character in a story, each with its own quirks. The layer_norm function gives every token a fair chance by adjusting its values so they are centered around zero and share the same scale. It calculates the mean and variance for each row, then transforms the values so that no token stands out too much or fades into the background. This independent normalization means that, regardless of sequence length or batch composition, every token's features are balanced and ready for the model to interpret. This approach keeps your model's training stable and efficient, especially when working with the unpredictable lengths and structures found in real-world NLP data.

question mark

Why is layer normalization preferred over batch normalization for NLP Transformers?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 4
some-alt