Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Transformer-Based Generative Models | Theoretical Foundations
Generative AI
course content

Kursinnhold

Generative AI

Generative AI

1. Introduction to Generative AI
2. Theoretical Foundations
3. Building and Training Generative Models
4. Applications of Generative AI
5. Ethical and Societal Implications
6. Future Trends and Challenges

book
Transformer-Based Generative Models

Introduction to Transformers and Self-Attention

Transformers are a foundational architecture in modern AI, especially in Natural Language Processing (NLP) and generative modeling. First introduced in the paper "Attention is All You Need" (Vaswani et al., 2017), transformers discard recurrence in favor of a mechanism called self-attention, which allows models to consider all parts of the input sequence at once.

Self-Attention Mechanism

The self-attention mechanism enables the model to weigh the importance of different tokens in a sequence relative to each other. This is done using three matrices derived from the input embeddings:

  • Query (Q);
  • Key (K);
  • Value (V).

The attention output is computed as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)V

Where:

  • QQ, KK, and VV are matrices derived from the input.
  • dkd_k is the dimension of the key vectors.
  • softmax\text{softmax} converts the similarity scores to probabilities.

This allows each token to attend to every other token and adjust its representation accordingly.

Transformer Architecture Overview

The transformer model consists of stacked encoder and decoder layers:

  • Encoder converts input into a contextualized latent representation;
  • Decoder generates output tokens using the encoder’s output and prior tokens.

Each layer includes:

  • Multi-Head Self-Attention;
  • Feedforward Neural Networks;
  • Layer Normalization;
  • Residual Connections.

Multi-Head Self-Attention

Instead of computing a single attention function, the transformer uses multiple attention heads. Each head learns to focus on different parts of the sequence.

Multi-Head(Q,K,V)=Concat(head1,head2,...,headn)W0\text{Multi-Head}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ... , \text{head}_n)W^0

Where each head is computed as:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Where:

  • WiQ,WiK,WiVW_i^Q, W_i^K, W_i^V are projection matrices for queries, keys, and values;
  • W0W^0 projects the concatenated heads back to the original dimension.

Feedforward Neural Networks

Each transformer block includes a position-wise feedforward network applied independently to each position:

FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(x W_1 + b_1)W_2 + b_2
  • It consists of two linear layers with a non-linearity (e.g., ReLU) in between;
  • Applies the same transformation across all positions.

Layer Normalization

Layer normalization normalizes the input across the features (channels) instead of batch. It stabilizes training and improves convergence:

LayerNorm(x)=xμσγ+β\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta

Where:

  • μ\mu is the mean of the features;
  • σ\sigma is the standard deviation;
  • γ\gamma and β\beta are learnable parameters.

Residual Connections

Residual connections add the input of each sub-layer to its output:

Output=Layer(x)+x\text{Output} = \text{Layer}(x) + x
  • This helps with gradient flow and enables training of deeper models;
  • Used around both the self-attention and feedforward layers.

In decoder-only models (like GPT), only the decoder is used with causal (masked) self-attention.

Generative Pre-trained Transformers (GPT)

GPT models are decoder-only transformers trained to predict the next token in an autoregressive fashion:

P(x1,x2,...,xn)=t=1nP(xtx<t)P(x_1,x_2,...,x_n)=\prod_{t=1}^n{P(x_t|x_{<t})}

Key features:

  • Trained on large-scale text datasets;
  • Can generate coherent and diverse text;
  • Widely used in applications like chatbots and code generation.

BERT and Masked Language Modeling

BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder. It is trained with masked language modeling (MLM):

  • Random tokens are replaced with a [MASK];
  • The model predicts the original token based on full context.
P(xix1,...,xi1,[MASK],xi+1,...,xn)P(x_i | x_1, ..., x_{i-1}, [\text{MASK}], x_{i+1}, ..., x_n)

This makes BERT good at tasks like classification, Q&A, and semantic similarity.

Transformers and LLMs

Transformers are the backbone of Large Language Models (LLMs) like GPT-3, GPT-4, PaLM, LLaMA, and Claude.

LLMs use large datasets and hundreds of billions of parameters, enabling them to:

  • Understand and generate human language;
  • Perform translation, summarization, Q&A, reasoning;
  • Power chatbots, document analyzers, and coding assistants.

Transformers' scalability and ability to model long-range dependencies make them ideal for these models.

1. What is the primary innovation introduced by transformers?

2. What distinguishes BERT from GPT?

3. Why are transformers ideal for LLMs?

question mark

What is the primary innovation introduced by transformers?

Select the correct answer

question mark

What distinguishes BERT from GPT?

Select the correct answer

question mark

Why are transformers ideal for LLMs?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 8
Vi beklager at noe gikk galt. Hva skjedde?
some-alt