Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn What Is Multi-Head Attention | Building Transformer Components
Transformers for Natural Language Processing

bookWhat Is Multi-Head Attention

Swipe to show menu

Multi-head attention is a powerful mechanism at the heart of the Transformer architecture. Its core idea is to allow the model to focus on different parts of a sentence simultaneously, capturing a wide range of relationships between words. To achieve this, the model splits each word’s embedding into several smaller vectors, called "heads." Each head runs its own attention calculation in parallel. This means that while one head might learn to focus on the immediate neighbors of a word, another could pay attention to the start of the sentence, and yet another might track relationships across longer distances.

This parallel attention enables the Transformer to capture diverse patterns and dependencies in text data. For example, in a sentence like "The cat, which was hungry, chased the mouse," one head might focus on the main subject and verb ("cat" and "chased"), while another could focus on the descriptive clause ("which was hungry"). By combining the outputs from all heads, the model builds a much richer understanding of the entire sentence than any single attention mechanism could provide.

To visualize how multi-head attention works, imagine a grid where each row represents a word in the input sentence and each column represents an attention head. Each cell in this grid shows which words a particular head is attending to for a given word. For instance, if you have the sentence:

"She enjoys reading books at night"

Suppose you have three attention heads. The visual grid might look like this:

In this grid, each head is learning to focus on different relationships. "Head 1" might track the grammatical flow, "Head 2" might focus on the subject, and "Head 3" might pay attention to location or time. This diversity of focus is what gives multi-head attention its strength in understanding complex language structures.

question mark

What is the primary benefit of using multi-head attention in Transformer models?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 1
some-alt