Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn How Self-Attention Works | Understanding Transformer Foundations
Transformers for Natural Language Processing

bookHow Self-Attention Works

Swipe to show menu

To grasp how self-attention operates, imagine reading the sentence: "The animal didn't cross the street because it was too tired." When you encounter the word "it," you need to understand which noun "it" refers to. Self-attention allows a model to look at all the words in the sentence and decide which ones are most relevant to each word's meaning. This is achieved using queries, keys, and values - mathematical representations for each word that help the model compute which words to pay attention to.

A helpful way to visualize self-attention is to use a grid that shows how much each word in a sentence "attends" to every other word. Also you can visualize self-attention using a heatmap, where each row and column corresponds to a word in the sentence. The cell color shows how much one word "attends" to another. In the heatmap below, darker cells indicate stronger attention between specific words. This visual helps you see which words the model connects most strongly as it processes the sentence:

Notice how the word "it" has a strong attention weight toward "animal" and "tired," showing that the model has learned that "it" refers to "animal" and is linked to being "tired." These attention distributions are learned during training and enable the model to capture context and relationships, regardless of word distance in the sentence. This mechanism is what gives Transformers their power to understand meaning in complex language.

question mark

Which sentence describes the benefits of self-attention?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 4
some-alt