Apprendre Inductive Biases of Self-Attention

Attention Mechanisms Theory

Glissez pour afficher le menu

Understanding the inductive biases of neural architectures is essential for grasping how models learn from data and generalize to new tasks. An inductive bias is any set of assumptions a model uses to predict outputs given inputs that it has not encountered before. Different neural architectures — such as self-attention, recurrence, and convolution — embed unique in ductive biases in their design. These biases influence how each model processes input sequences and what patterns they are predisposed to capture. By examining the inductive biases of self-attention and contrasting them with those of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), you can better appreciate why attention-based models have become so prominent in modern machine learning.

Self-attention introduces two crucial inductive biases: permutation equivariance and global context access. Unlike RNNs, which process sequences in a fixed order and inherently encode a notion of sequence direction, self-attention treats the input as a set rather than a strict sequence. This means that if you permute the input tokens, the output will permute in the same way, preserving relative relationships but not depending on absolute positions. This property is called permutation equivariance.

Additionally, self-attention enables every token in a sequence to attend to every other token in a single layer, granting the model global context access. In contrast, convolutions only allow each token to interact with a fixed-size local neighborhood, and RNNs must propagate information sequentially, making long-range dependencies harder to capture. These differences mean that self-attention is highly flexible, able to model complex dependencies without being restricted by local or sequential processing. However, this flexibility comes at the cost of removing some helpful inductive biases—such as locality in CNNs or sequential order in RNNs—which can make training less efficient for tasks where such biases are beneficial.

Definition

Permutation equivariance means that if the input sequence is permuted, the model's output is permuted in the same way. This property allows self-attention to process sequences without assuming a fixed order, making it highly flexible for tasks where the relative position of elements matters more than their absolute order.

Self-Attention

Context access: every token can directly attend to every other token (global access);
Inductive bias: permutation equivariant, minimal built-in assumptions about locality or order;
Trade-off: highly flexible, but may require more data to learn useful order or local patterns.

Convolution

Context access: each token interacts only with its local neighborhood (local access);
Inductive bias: strong locality bias, assumes nearby tokens are more related;
Trade-off: efficient for local patterns, but struggles with long-range dependencies.

Recurrence (RNN)

Context access: sequential, each token sees previous tokens through hidden state (stepwise access);
Inductive bias: strong sequential bias, encodes order and temporal dependencies;
Trade-off: good for ordered data, but slow and less effective for capturing global context.

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 2

Inductive Biases of Self-Attention

1. What inductive bias does self-attention remove compared to RNNs?

2. How does permutation equivariance affect the way self-attention processes sequences?

3. What is the main trade-off between global context access and local inductive bias?