Course Content

Generative AI

1. Introduction to Generative AI

What is Generative AI?History and Evolution Types of Generative AI Models

2. Theoretical Foundations

Probability Distributions and Randomness in AI Bayesian Inference and Markov Processes Understanding Information and Optimization in AI Overview of Artificial Neural Networks Recurrent Neural Networks (RNNs) and Sequence Generation Variational Autoencoders (VAEs)Generative Adversarial Networks (GANs)Transformer-Based Generative Models Diffusion Models and Probabilistic Generative Approaches

3. Building and Training Generative Models

Data Collection and Preprocessing Training and Optimization Evaluation Metrics for Generative AI Challenge: Build Simple VAE

4. Ethical, Regulatory, and Future Perspectives in Generative AI

Bias, Fairness, and Representation Deepfakes and Misinformation Intellectual Property and Ownership Sustainability and Scaling Challenges Global Policy and AI Governance

Evaluation Metrics for Generative AI

Evaluating generative models differs from evaluating discriminative models, which rely on accuracy metrics. Since generative models produce many valid outputs, they must be assessed for quality, diversity, and relevance. This section introduces key metrics used in both research and industry to evaluate generative models across perceptual, statistical, and human-centered dimensions.

Evaluation for Image-Based Models (GANs, VAEs, Diffusion)

Perceptual and statistical evaluation methods are commonly applied to image-based generative models. These help measure how realistic, diverse, and well-distributed the generated outputs are compared to real images.

Inception Score (IS)

Quantifies both the clarity and diversity of generated images using the classification confidence of a pretrained Inception model.

\text{IS}=\exp(\mathbb{E}_x[D_{KL}(p(y|x)||p(y))])

where:

$p(y|x)$ is the conditional label distribution for image $x$
$p(y)$ is the marginal class distribution.

Fréchet Inception Distance (FID)

Measures the similarity between real and generated image distributions using feature embeddings.

\text{FID}=||\mu_r-\mu_g||^2+\text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})

where:

$\mu$ and $\Sigma$ are mean and covariance of feature representations.
$\text{Tr}()$ stands for the trace of a matrix — it is the sum of the diagonal elements. The trace helps quantify how different the feature distributions are in terms of their spread or shape.

LPIPS

Compares visual similarity between images using deep network features.

Evaluation for Text-Based Models (Transformers, GPT, BERT)

Language generation models are evaluated for quality, coherence, and relevance through statistical, semantic, and subjective metrics.

BLEU / ROUGE / METEOR

Compare n-gram overlap between generated and reference text.

\text{BLEU}=\text{BP} \cdot \exp\left(\sum^N_{n=1}w_n\log p_n\right)

where:

$p_n$ is precision for n-grams
$\text{BP}$ is brevity penalty.

BERTScore

Measures semantic similarity using contextual embeddings. Uses cosine similarity between contextual embeddings, with precision/recall/F1 aggregates.

Prompt Fidelity

Measures adherence of output to input prompts, especially in instruction-tuned models.

Note

Manually compare prompts to outputs or use similarity scoring models like CLIP or BERT.

Evaluation for Multimodal Models (e.g., DALL·E, Stable Diffusion)

Multimodal models need to be evaluated for alignment between modalities, such as image and text.

CLIPScore

Computes similarity between image embeddings and textual prompt embeddings.

\text{CLIPScores}=cos(f_{image},\ f_{text})

where $f$ are modality-specific embeddings.

Prompt-to-Image Fidelity

Measures how well generated images match their conditioning prompts.

Note

Use CLIP or manual annotation to judge visual-textual alignment.

Human Evaluation

Despite advances in automated metrics, human evaluation remains essential for subjective or creative tasks. Many generative outputs, especially in art, storytelling, or design, require human judgment to assess their meaningfulness, originality, and appeal. These methods provide nuanced insights that automated metrics often miss.

A/B Testing and Turing-Style Setups

Ask users to choose preferred or real-looking outputs from two options.

Real-World Example: in OpenAI's GPT-3 RLHF pipeline, crowdworkers were shown multiple model completions and asked to rank or select the most helpful or realistic one. This feedback directly shaped reward models for further fine-tuning.

Prompt-to-Output Fidelity

Subjective evaluation of how well the output reflects the given prompt.

Real-World Example: during RLHF training for InstructGPT, annotators rated completions for a prompt like "Write a polite email declining a job offer." Human scores determined which outputs aligned with the user's intent and style.

Rating Scales

Collect ratings on scales (e.g., 1–5) for realism, coherence, or creativity.

Real-World Example: in Anthropic's Claude evaluations, researchers collected 1–5 star ratings on helpfulness, honesty, and harmlessness for generations in dialogue, aiding model alignment goals.

Crowdsourced Evaluation

Use platforms like MTurk to gather diverse opinions. Ensure rater agreement.

Real-World Example: Google used large-scale crowdsourcing to assess LaMDA chatbot quality on dimensions like sensibleness and specificity by aggregating thousands of user judgments.

Study More

Use a hybrid of automatic and human-centered evaluations to obtain a fuller understanding of generative model performance. Human insight helps validate metric reliability and identify subtle failure cases not captured by numbers. For critical applications, combining multiple human raters and computing inter-rater reliability (e.g., Cohen’s kappa) can improve robustness.

Summary

These evaluation strategies are indispensable for iterating on model development and guiding deployment decisions. Combining objective metrics with human feedback helps developers balance realism, creativity, diversity, and alignment with user intent or task requirements. Effective evaluation ensures that generative AI models perform not just technically well, but also align with real-world use cases and human expectations.

1. Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

2. What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

3. Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

Select the correct answer

Fréchet Inception Distance (FID)

LPIPS

Inception Score (IS)

BLEU

What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

Select the correct answer

To measure the clarity of text generated by models

To compare the feature distributions of real and generated images

To measure the similarity between text and images

To evaluate the precision of a model’s predictions

Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

Select the correct answer

Inception Score

BLEU

BERTScore

FID

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Generative AI

1. Introduction to Generative AI

What is Generative AI?History and Evolution Types of Generative AI Models

2. Theoretical Foundations

3. Building and Training Generative Models

Data Collection and Preprocessing Training and Optimization Evaluation Metrics for Generative AI Challenge: Build Simple VAE

4. Ethical, Regulatory, and Future Perspectives in Generative AI

Bias, Fairness, and Representation Deepfakes and Misinformation Intellectual Property and Ownership Sustainability and Scaling Challenges Global Policy and AI Governance

Evaluation Metrics for Generative AI

Evaluation for Image-Based Models (GANs, VAEs, Diffusion)

Inception Score (IS)

Quantifies both the clarity and diversity of generated images using the classification confidence of a pretrained Inception model.

\text{IS}=\exp(\mathbb{E}_x[D_{KL}(p(y|x)||p(y))])

where:

$p(y|x)$ is the conditional label distribution for image $x$
$p(y)$ is the marginal class distribution.

Fréchet Inception Distance (FID)

Measures the similarity between real and generated image distributions using feature embeddings.

\text{FID}=||\mu_r-\mu_g||^2+\text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})

where:

$\mu$ and $\Sigma$ are mean and covariance of feature representations.
$\text{Tr}()$ stands for the trace of a matrix — it is the sum of the diagonal elements. The trace helps quantify how different the feature distributions are in terms of their spread or shape.

LPIPS

Compares visual similarity between images using deep network features.

Evaluation for Text-Based Models (Transformers, GPT, BERT)

Language generation models are evaluated for quality, coherence, and relevance through statistical, semantic, and subjective metrics.

BLEU / ROUGE / METEOR

Compare n-gram overlap between generated and reference text.

\text{BLEU}=\text{BP} \cdot \exp\left(\sum^N_{n=1}w_n\log p_n\right)

where:

$p_n$ is precision for n-grams
$\text{BP}$ is brevity penalty.

BERTScore

Measures semantic similarity using contextual embeddings. Uses cosine similarity between contextual embeddings, with precision/recall/F1 aggregates.

Prompt Fidelity

Measures adherence of output to input prompts, especially in instruction-tuned models.

Note

Manually compare prompts to outputs or use similarity scoring models like CLIP or BERT.

Evaluation for Multimodal Models (e.g., DALL·E, Stable Diffusion)

Multimodal models need to be evaluated for alignment between modalities, such as image and text.

CLIPScore

Computes similarity between image embeddings and textual prompt embeddings.

\text{CLIPScores}=cos(f_{image},\ f_{text})

where $f$ are modality-specific embeddings.

Prompt-to-Image Fidelity

Measures how well generated images match their conditioning prompts.

Note

Use CLIP or manual annotation to judge visual-textual alignment.

Human Evaluation

A/B Testing and Turing-Style Setups

Ask users to choose preferred or real-looking outputs from two options.

Real-World Example: in OpenAI's GPT-3 RLHF pipeline, crowdworkers were shown multiple model completions and asked to rank or select the most helpful or realistic one. This feedback directly shaped reward models for further fine-tuning.

Prompt-to-Output Fidelity

Subjective evaluation of how well the output reflects the given prompt.

Real-World Example: during RLHF training for InstructGPT, annotators rated completions for a prompt like "Write a polite email declining a job offer." Human scores determined which outputs aligned with the user's intent and style.

Rating Scales

Collect ratings on scales (e.g., 1–5) for realism, coherence, or creativity.

Real-World Example: in Anthropic's Claude evaluations, researchers collected 1–5 star ratings on helpfulness, honesty, and harmlessness for generations in dialogue, aiding model alignment goals.

Crowdsourced Evaluation

Use platforms like MTurk to gather diverse opinions. Ensure rater agreement.

Real-World Example: Google used large-scale crowdsourcing to assess LaMDA chatbot quality on dimensions like sensibleness and specificity by aggregating thousands of user judgments.

Study More

Summary

1. Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

2. What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

3. Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

Select the correct answer

Fréchet Inception Distance (FID)

LPIPS

Inception Score (IS)

BLEU

What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

Select the correct answer

To measure the clarity of text generated by models

To compare the feature distributions of real and generated images

To measure the similarity between text and images

To evaluate the precision of a model’s predictions

Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

Select the correct answer

Inception Score

BLEU

BERTScore

FID

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 3