Kursinnhold

Generative AI

1. Introduction to Generative AI

What is Generative AI?History and Evolution Types of Generative AI Models

2. Theoretical Foundations

Probability Distributions and Randomness in AI Bayesian Inference and Markov Processes Understanding Information and Optimization in AI Overview of Artificial Neural Networks Recurrent Neural Networks (RNNs) and Sequence Generation Variational Autoencoders (VAEs)Generative Adversarial Networks (GANs)Transformer-Based Generative Models Diffusion Models and Probabilistic Generative Approaches

3. Building and Training Generative Models

Data Collection and Preprocessing Training and Optimization Evaluation Metrics for Generative AI Challenge: Build Simple VAE

4. Ethical, Regulatory, and Future Perspectives in Generative AI

Bias, Fairness, and Representation Deepfakes and Misinformation Intellectual Property and Ownership Sustainability and Scaling Challenges Global Policy and AI Governance

Transformer-Based Generative Models

Introduction to Transformers and Self-Attention

Transformers are a foundational architecture in modern AI, especially in Natural Language Processing (NLP) and generative modeling. First introduced in the paper "Attention is All You Need" (Vaswani et al., 2017), transformers discard recurrence in favor of a mechanism called self-attention, which allows models to consider all parts of the input sequence at once.

Self-Attention Mechanism

The self-attention mechanism enables the model to weigh the importance of different tokens in a sequence relative to each other. This is done using three matrices derived from the input embeddings:

Query (Q);
Key (K);
Value (V).

The attention output is computed as:

\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)V

Where:

$Q$ , $K$ , and $V$ are matrices derived from the input.
$d_k$ is the dimension of the key vectors.
$\text{softmax}$ converts the similarity scores to probabilities.

This allows each token to attend to every other token and adjust its representation accordingly.

Transformer Architecture Overview

The transformer model consists of stacked encoder and decoder layers:

Encoder converts input into a contextualized latent representation;
Decoder generates output tokens using the encoder’s output and prior tokens.

Each layer includes:

Multi-Head Self-Attention;
Feedforward Neural Networks;
Layer Normalization;
Residual Connections.

Multi-Head Self-Attention

Instead of computing a single attention function, the transformer uses multiple attention heads. Each head learns to focus on different parts of the sequence.

\text{Multi-Head}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ... , \text{head}_n)W^0

Where each head is computed as:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Where:

$W_i^Q, W_i^K, W_i^V$ are projection matrices for queries, keys, and values;
$W^0$ projects the concatenated heads back to the original dimension.

Feedforward Neural Networks

Each transformer block includes a position-wise feedforward network applied independently to each position:

\text{FFN}(x) = \text{ReLU}(x W_1 + b_1)W_2 + b_2

It consists of two linear layers with a non-linearity (e.g., ReLU) in between;
Applies the same transformation across all positions.

Layer Normalization

Layer normalization normalizes the input across the features (channels) instead of batch. It stabilizes training and improves convergence:

\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta

Where:

$\mu$ is the mean of the features;
$\sigma$ is the standard deviation;
$\gamma$ and $\beta$ are learnable parameters.

Residual Connections

Residual connections add the input of each sub-layer to its output:

\text{Output} = \text{Layer}(x) + x

This helps with gradient flow and enables training of deeper models;
Used around both the self-attention and feedforward layers.

In decoder-only models (like GPT), only the decoder is used with causal (masked) self-attention.

Generative Pre-trained Transformers (GPT)

GPT models are decoder-only transformers trained to predict the next token in an autoregressive fashion:

P(x_1,x_2,...,x_n)=\prod_{t=1}^n{P(x_t|x_{<t})}

Key features:

Trained on large-scale text datasets;
Can generate coherent and diverse text;
Widely used in applications like chatbots and code generation.

BERT and Masked Language Modeling

BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder. It is trained with masked language modeling (MLM):

Random tokens are replaced with a [MASK];
The model predicts the original token based on full context.

P(x_i | x_1, ..., x_{i-1}, [\text{MASK}], x_{i+1}, ..., x_n)

This makes BERT good at tasks like classification, Q&A, and semantic similarity.

Transformers and LLMs

Transformers are the backbone of Large Language Models (LLMs) like GPT-3, GPT-4, PaLM, LLaMA, and Claude.

LLMs use large datasets and hundreds of billions of parameters, enabling them to:

Understand and generate human language;
Perform translation, summarization, Q&A, reasoning;
Power chatbots, document analyzers, and coding assistants.

Transformers' scalability and ability to model long-range dependencies make them ideal for these models.

1. What is the primary innovation introduced by transformers?

2. What distinguishes BERT from GPT?

3. Why are transformers ideal for LLMs?

What is the primary innovation introduced by transformers?

Select the correct answer

Recurrent connections

Self-attention for sequence modeling

Convolutional context windows

Memory cells like in LSTM

What distinguishes BERT from GPT?

Select the correct answer

GPT uses masking, BERT doesn't.

BERT is decoder-only, GPT is encoder-only.

BERT is bidirectional and uses masking.

GPT generates tokens backward.

Why are transformers ideal for LLMs?

Select the correct answer

Easy to implement

Fixed-size input only

Scalable and model long-range dependencies

Require less training data

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 8

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Kursinnhold

Generative AI

1. Introduction to Generative AI

What is Generative AI?History and Evolution Types of Generative AI Models

2. Theoretical Foundations

3. Building and Training Generative Models

Data Collection and Preprocessing Training and Optimization Evaluation Metrics for Generative AI Challenge: Build Simple VAE

4. Ethical, Regulatory, and Future Perspectives in Generative AI

Bias, Fairness, and Representation Deepfakes and Misinformation Intellectual Property and Ownership Sustainability and Scaling Challenges Global Policy and AI Governance

Transformer-Based Generative Models

Introduction to Transformers and Self-Attention

Self-Attention Mechanism

The self-attention mechanism enables the model to weigh the importance of different tokens in a sequence relative to each other. This is done using three matrices derived from the input embeddings:

Query (Q);
Key (K);
Value (V).

The attention output is computed as:

\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)V

Where:

$Q$ , $K$ , and $V$ are matrices derived from the input.
$d_k$ is the dimension of the key vectors.
$\text{softmax}$ converts the similarity scores to probabilities.

This allows each token to attend to every other token and adjust its representation accordingly.

Transformer Architecture Overview

The transformer model consists of stacked encoder and decoder layers:

Encoder converts input into a contextualized latent representation;
Decoder generates output tokens using the encoder’s output and prior tokens.

Each layer includes:

Multi-Head Self-Attention;
Feedforward Neural Networks;
Layer Normalization;
Residual Connections.

Multi-Head Self-Attention

Instead of computing a single attention function, the transformer uses multiple attention heads. Each head learns to focus on different parts of the sequence.

\text{Multi-Head}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ... , \text{head}_n)W^0

Where each head is computed as:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Where:

$W_i^Q, W_i^K, W_i^V$ are projection matrices for queries, keys, and values;
$W^0$ projects the concatenated heads back to the original dimension.

Feedforward Neural Networks

Each transformer block includes a position-wise feedforward network applied independently to each position:

\text{FFN}(x) = \text{ReLU}(x W_1 + b_1)W_2 + b_2

It consists of two linear layers with a non-linearity (e.g., ReLU) in between;
Applies the same transformation across all positions.

Layer Normalization

Layer normalization normalizes the input across the features (channels) instead of batch. It stabilizes training and improves convergence:

\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta

Where:

$\mu$ is the mean of the features;
$\sigma$ is the standard deviation;
$\gamma$ and $\beta$ are learnable parameters.

Residual Connections

Residual connections add the input of each sub-layer to its output:

\text{Output} = \text{Layer}(x) + x

This helps with gradient flow and enables training of deeper models;
Used around both the self-attention and feedforward layers.

In decoder-only models (like GPT), only the decoder is used with causal (masked) self-attention.

Generative Pre-trained Transformers (GPT)

GPT models are decoder-only transformers trained to predict the next token in an autoregressive fashion:

P(x_1,x_2,...,x_n)=\prod_{t=1}^n{P(x_t|x_{<t})}

Key features:

Trained on large-scale text datasets;
Can generate coherent and diverse text;
Widely used in applications like chatbots and code generation.

BERT and Masked Language Modeling

BERT (Bidirectional Encoder Representations from Transformers) uses only the encoder. It is trained with masked language modeling (MLM):

Random tokens are replaced with a [MASK];
The model predicts the original token based on full context.

P(x_i | x_1, ..., x_{i-1}, [\text{MASK}], x_{i+1}, ..., x_n)

This makes BERT good at tasks like classification, Q&A, and semantic similarity.

Transformers and LLMs

Transformers are the backbone of Large Language Models (LLMs) like GPT-3, GPT-4, PaLM, LLaMA, and Claude.

LLMs use large datasets and hundreds of billions of parameters, enabling them to:

Understand and generate human language;
Perform translation, summarization, Q&A, reasoning;
Power chatbots, document analyzers, and coding assistants.

Transformers' scalability and ability to model long-range dependencies make them ideal for these models.

1. What is the primary innovation introduced by transformers?

2. What distinguishes BERT from GPT?

3. Why are transformers ideal for LLMs?

What is the primary innovation introduced by transformers?

Select the correct answer

Recurrent connections

Self-attention for sequence modeling

Convolutional context windows

Memory cells like in LSTM

What distinguishes BERT from GPT?

Select the correct answer

GPT uses masking, BERT doesn't.

BERT is decoder-only, GPT is encoder-only.

BERT is bidirectional and uses masking.

GPT generates tokens backward.

Why are transformers ideal for LLMs?

Select the correct answer

Easy to implement

Fixed-size input only

Scalable and model long-range dependencies

Require less training data

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 8