Lära Transformer Architecture Essentials | Introduction to Transformers and Transfer Learning

Svep för att visa menyn

Transformers have revolutionized natural language processing by introducing a flexible architecture designed to handle sequences of data efficiently. At the heart of the Transformer is the encoder-decoder structure. The encoder processes the input data, such as a sentence, and converts it into a set of continuous representations. The decoder then takes these representations to generate the desired output, like a translated sentence or a summary. What sets Transformers apart from previous models is their use of self-attention and positional encoding.

Self-attention is a mechanism that allows the model to weigh the importance of different words in a sequence when encoding a particular word. This enables the model to capture relationships between words, regardless of their position in the sequence. However, since Transformers do not process data sequentially, they need a way to understand the order of the words. This is where positional encoding comes in. Positional encoding injects information about the position of each word into its representation, ensuring the model is aware of the sequence order.

Definition

Self-attention is a process where each element of a sequence (such as a word in a sentence) attends to all other elements, assigning different levels of importance to each. This helps the model understand context by considering relationships between all words at once.

Definition

Positional encoding provides a way for the model to incorporate the order of tokens in the sequence, since the Transformer itself does not inherently process data in order. By adding unique position-based vectors to each word embedding, the model can distinguish between different positions in the input.

To better understand how self-attention works, consider the scaled dot-product attention mechanism, which is a core building block of the Transformer. This mechanism takes three inputs: queries, keys, and values. Each word in the sequence is projected into these three representations. The attention scores are computed by taking the dot product of the query with all keys, scaling by the square root of the key dimension, and applying a softmax to obtain weights. These weights are then used to produce a weighted sum of the values, resulting in the output for each word.

The self-attention operation for a set of queries $Q$ , keys $K$ , and values $V$ can be written as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Here, $d_k$ is the dimensionality of the key vectors. This formula shows how attention weights are calculated and used to combine the value vectors for each position.


              1234567891011121314151617181920212223242526
            
import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Q, K, V: numpy arrays of shape (seq_len, d_k)
    Returns: attention output of shape (seq_len, d_k)
    """
    d_k = Q.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)
    output = np.dot(weights, V)
    return output

# Example input: 3 words, embedding size 4
Q = np.array([[1, 0, 1, 0],
              [0, 1, 0, 1],
              [1, 1, 1, 1]], dtype=float)
K = np.array([[1, 0, 1, 0],
              [0, 1, 0, 1],
              [1, 1, 1, 1]], dtype=float)
V = np.array([[0.1, 0.2, 0.3, 0.4],
              [0.5, 0.6, 0.7, 0.8],
              [0.9, 1.0, 1.1, 1.2]], dtype=float)

attention_output = scaled_dot_product_attention(Q, K, V)
print(attention_output)

The attention output shows how each word's representation is updated by considering all other words in the sequence, weighted by their computed importance. This mechanism is highly efficient and forms the basis of the Transformer's ability to model complex relationships in data.

Note

A major advantage of attention is that it enables parallelization during training and inference. Unlike recurrent neural networks, which process sequences one step at a time, Transformers can process all positions simultaneously. This not only speeds up computation but also allows the model to capture context from the entire sequence at once, improving its ability to model long-range dependencies.

Understanding these architectural essentials provides the foundation for exploring more advanced concepts in Transformers. To reinforce your understanding, consider the following question about the role of attention in the Transformer model.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 1

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 1. Kapitel 1