Learn How Layer Normalization Stabilizes Transformers

Swipe to show menu

Modern deep learning models rely on normalization techniques to help them train efficiently and accurately. In natural language processing (NLP), where input sequences can vary in length and structure, normalization is especially important for stabilizing the learning process. Two common normalization methods are batch normalization and layer normalization, but they serve different purposes and are suited to different types of data.

Batch Normalization

Batch normalization computes the mean and variance of each feature across the entire batch of inputs. This approach works well for computer vision tasks, where each image in a batch is typically of the same size and structure. However, in NLP tasks, especially with variable-length sequences, batch normalization can introduce unwanted dependencies between samples in a batch and may not handle varying sequence lengths gracefully.

Layer Normalization

Layer normalization, on the other hand, normalizes input features within each data point (such as each token embedding in a sentence) independently of other data points in the batch. This makes it much more suitable for NLP and sequence modeling. By standardizing the summed inputs to a neuron within a layer, layer normalization ensures that each token's representation is on a similar scale, regardless of its position or the batch's composition. This helps Transformers train stably and represent text more effectively, especially when dealing with long or complex sentences.


              123456789101112131415161718192021
            
import numpy as np

def layer_norm(x, epsilon=1e-5):
    """
    Applies layer normalization to a 2D numpy array x.
    Each row is normalized independently.
    """
    mean = np.mean(x, axis=1, keepdims=True)
    variance = np.var(x, axis=1, keepdims=True)
    normalized = (x - mean) / np.sqrt(variance + epsilon)
    # Optional: learnable scale and bias (gamma and beta)
    gamma = np.ones_like(mean)
    beta = np.zeros_like(mean)
    return gamma * normalized + beta

# Example: normalize a batch of 3 token embeddings (rows)
embeddings = np.array([[2.0, 4.0, 6.0],
                       [1.0, 3.0, 5.0],
                       [0.0, 0.0, 0.0]])
normalized_embeddings = layer_norm(embeddings)
print(normalized_embeddings)

Picture each row in your input as a unique token embedding - like a character in a story, each with its own quirks. The layer_norm function gives every token a fair chance by adjusting its values so they are centered around zero and share the same scale. It calculates the mean and variance for each row, then transforms the values so that no token stands out too much or fades into the background. This independent normalization means that, regardless of sequence length or batch composition, every token's features are balanced and ready for the model to interpret. This approach keeps your model's training stable and efficient, especially when working with the unpredictable lengths and structures found in real-world NLP data.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 4