Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Evaluation Metrics for Generative AI | Building and Training Generative Models
Generative AI
course content

Contenu du cours

Generative AI

Generative AI

1. Introduction to Generative AI
2. Theoretical Foundations
3. Building and Training Generative Models
4. Ethical, Regulatory, and Future Perspectives in Generative AI

book
Evaluation Metrics for Generative AI

Evaluating generative models differs from evaluating discriminative models, which rely on accuracy metrics. Since generative models produce many valid outputs, they must be assessed for quality, diversity, and relevance. This section introduces key metrics used in both research and industry to evaluate generative models across perceptual, statistical, and human-centered dimensions.

Evaluation for Image-Based Models (GANs, VAEs, Diffusion)

Perceptual and statistical evaluation methods are commonly applied to image-based generative models. These help measure how realistic, diverse, and well-distributed the generated outputs are compared to real images.

Inception Score (IS)

Quantifies both the clarity and diversity of generated images using the classification confidence of a pretrained Inception model.

IS=exp(Ex[DKL(p(yx)p(y))])\text{IS}=\exp(\mathbb{E}_x[D_{KL}(p(y|x)||p(y))])

where:

  • p(yx)p(y|x) is the conditional label distribution for image xx

  • p(y)p(y) is the marginal class distribution.

Fréchet Inception Distance (FID)

Measures the similarity between real and generated image distributions using feature embeddings.

FID=μrμg2+Tr(Σr+Σg2(ΣrΣg)1/2)\text{FID}=||\mu_r-\mu_g||^2+\text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})

where:

  • μ\mu and Σ\Sigma are mean and covariance of feature representations.

  • Tr()\text{Tr}() stands for the trace of a matrix — it is the sum of the diagonal elements. The trace helps quantify how different the feature distributions are in terms of their spread or shape.

LPIPS

Compares visual similarity between images using deep network features.

Evaluation for Text-Based Models (Transformers, GPT, BERT)

Language generation models are evaluated for quality, coherence, and relevance through statistical, semantic, and subjective metrics.

BLEU / ROUGE / METEOR

Compare n-gram overlap between generated and reference text.

BLEU=BPexp(n=1Nwnlogpn)\text{BLEU}=\text{BP} \cdot \exp\left(\sum^N_{n=1}w_n\log p_n\right)

where:

  • pnp_n is precision for n-grams

  • BP\text{BP} is brevity penalty.

BERTScore

Measures semantic similarity using contextual embeddings. Uses cosine similarity between contextual embeddings, with precision/recall/F1 aggregates.

Prompt Fidelity

Measures adherence of output to input prompts, especially in instruction-tuned models.

Note
Note

Manually compare prompts to outputs or use similarity scoring models like CLIP or BERT.

Evaluation for Multimodal Models (e.g., DALL·E, Stable Diffusion)

Multimodal models need to be evaluated for alignment between modalities, such as image and text.

CLIPScore

Computes similarity between image embeddings and textual prompt embeddings.

CLIPScores=cos(fimage, ftext)\text{CLIPScores}=cos(f_{image},\ f_{text})

where ff are modality-specific embeddings.

Prompt-to-Image Fidelity

Measures how well generated images match their conditioning prompts.

Note
Note

Use CLIP or manual annotation to judge visual-textual alignment.

Human Evaluation

Despite advances in automated metrics, human evaluation remains essential for subjective or creative tasks. Many generative outputs, especially in art, storytelling, or design, require human judgment to assess their meaningfulness, originality, and appeal. These methods provide nuanced insights that automated metrics often miss.

A/B Testing and Turing-Style Setups

Ask users to choose preferred or real-looking outputs from two options.

  • Real-World Example: in OpenAI's GPT-3 RLHF pipeline, crowdworkers were shown multiple model completions and asked to rank or select the most helpful or realistic one. This feedback directly shaped reward models for further fine-tuning.

Prompt-to-Output Fidelity

Subjective evaluation of how well the output reflects the given prompt.

  • Real-World Example: during RLHF training for InstructGPT, annotators rated completions for a prompt like "Write a polite email declining a job offer." Human scores determined which outputs aligned with the user's intent and style.

Rating Scales

Collect ratings on scales (e.g., 1–5) for realism, coherence, or creativity.

  • Real-World Example: in Anthropic's Claude evaluations, researchers collected 1–5 star ratings on helpfulness, honesty, and harmlessness for generations in dialogue, aiding model alignment goals.

Crowdsourced Evaluation

Use platforms like MTurk to gather diverse opinions. Ensure rater agreement.

  • Real-World Example: Google used large-scale crowdsourcing to assess LaMDA chatbot quality on dimensions like sensibleness and specificity by aggregating thousands of user judgments.

Note
Study More

Use a hybrid of automatic and human-centered evaluations to obtain a fuller understanding of generative model performance. Human insight helps validate metric reliability and identify subtle failure cases not captured by numbers. For critical applications, combining multiple human raters and computing inter-rater reliability (e.g., Cohen’s kappa) can improve robustness.

Summary

These evaluation strategies are indispensable for iterating on model development and guiding deployment decisions. Combining objective metrics with human feedback helps developers balance realism, creativity, diversity, and alignment with user intent or task requirements. Effective evaluation ensures that generative AI models perform not just technically well, but also align with real-world use cases and human expectations.

1. Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

2. What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

3. Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

question mark

Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

Select the correct answer

question mark

What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

Select the correct answer

question mark

Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 3. Chapitre 3

Demandez à l'IA

expand
ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

course content

Contenu du cours

Generative AI

Generative AI

1. Introduction to Generative AI
2. Theoretical Foundations
3. Building and Training Generative Models
4. Ethical, Regulatory, and Future Perspectives in Generative AI

book
Evaluation Metrics for Generative AI

Evaluating generative models differs from evaluating discriminative models, which rely on accuracy metrics. Since generative models produce many valid outputs, they must be assessed for quality, diversity, and relevance. This section introduces key metrics used in both research and industry to evaluate generative models across perceptual, statistical, and human-centered dimensions.

Evaluation for Image-Based Models (GANs, VAEs, Diffusion)

Perceptual and statistical evaluation methods are commonly applied to image-based generative models. These help measure how realistic, diverse, and well-distributed the generated outputs are compared to real images.

Inception Score (IS)

Quantifies both the clarity and diversity of generated images using the classification confidence of a pretrained Inception model.

IS=exp(Ex[DKL(p(yx)p(y))])\text{IS}=\exp(\mathbb{E}_x[D_{KL}(p(y|x)||p(y))])

where:

  • p(yx)p(y|x) is the conditional label distribution for image xx

  • p(y)p(y) is the marginal class distribution.

Fréchet Inception Distance (FID)

Measures the similarity between real and generated image distributions using feature embeddings.

FID=μrμg2+Tr(Σr+Σg2(ΣrΣg)1/2)\text{FID}=||\mu_r-\mu_g||^2+\text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})

where:

  • μ\mu and Σ\Sigma are mean and covariance of feature representations.

  • Tr()\text{Tr}() stands for the trace of a matrix — it is the sum of the diagonal elements. The trace helps quantify how different the feature distributions are in terms of their spread or shape.

LPIPS

Compares visual similarity between images using deep network features.

Evaluation for Text-Based Models (Transformers, GPT, BERT)

Language generation models are evaluated for quality, coherence, and relevance through statistical, semantic, and subjective metrics.

BLEU / ROUGE / METEOR

Compare n-gram overlap between generated and reference text.

BLEU=BPexp(n=1Nwnlogpn)\text{BLEU}=\text{BP} \cdot \exp\left(\sum^N_{n=1}w_n\log p_n\right)

where:

  • pnp_n is precision for n-grams

  • BP\text{BP} is brevity penalty.

BERTScore

Measures semantic similarity using contextual embeddings. Uses cosine similarity between contextual embeddings, with precision/recall/F1 aggregates.

Prompt Fidelity

Measures adherence of output to input prompts, especially in instruction-tuned models.

Note
Note

Manually compare prompts to outputs or use similarity scoring models like CLIP or BERT.

Evaluation for Multimodal Models (e.g., DALL·E, Stable Diffusion)

Multimodal models need to be evaluated for alignment between modalities, such as image and text.

CLIPScore

Computes similarity between image embeddings and textual prompt embeddings.

CLIPScores=cos(fimage, ftext)\text{CLIPScores}=cos(f_{image},\ f_{text})

where ff are modality-specific embeddings.

Prompt-to-Image Fidelity

Measures how well generated images match their conditioning prompts.

Note
Note

Use CLIP or manual annotation to judge visual-textual alignment.

Human Evaluation

Despite advances in automated metrics, human evaluation remains essential for subjective or creative tasks. Many generative outputs, especially in art, storytelling, or design, require human judgment to assess their meaningfulness, originality, and appeal. These methods provide nuanced insights that automated metrics often miss.

A/B Testing and Turing-Style Setups

Ask users to choose preferred or real-looking outputs from two options.

  • Real-World Example: in OpenAI's GPT-3 RLHF pipeline, crowdworkers were shown multiple model completions and asked to rank or select the most helpful or realistic one. This feedback directly shaped reward models for further fine-tuning.

Prompt-to-Output Fidelity

Subjective evaluation of how well the output reflects the given prompt.

  • Real-World Example: during RLHF training for InstructGPT, annotators rated completions for a prompt like "Write a polite email declining a job offer." Human scores determined which outputs aligned with the user's intent and style.

Rating Scales

Collect ratings on scales (e.g., 1–5) for realism, coherence, or creativity.

  • Real-World Example: in Anthropic's Claude evaluations, researchers collected 1–5 star ratings on helpfulness, honesty, and harmlessness for generations in dialogue, aiding model alignment goals.

Crowdsourced Evaluation

Use platforms like MTurk to gather diverse opinions. Ensure rater agreement.

  • Real-World Example: Google used large-scale crowdsourcing to assess LaMDA chatbot quality on dimensions like sensibleness and specificity by aggregating thousands of user judgments.

Note
Study More

Use a hybrid of automatic and human-centered evaluations to obtain a fuller understanding of generative model performance. Human insight helps validate metric reliability and identify subtle failure cases not captured by numbers. For critical applications, combining multiple human raters and computing inter-rater reliability (e.g., Cohen’s kappa) can improve robustness.

Summary

These evaluation strategies are indispensable for iterating on model development and guiding deployment decisions. Combining objective metrics with human feedback helps developers balance realism, creativity, diversity, and alignment with user intent or task requirements. Effective evaluation ensures that generative AI models perform not just technically well, but also align with real-world use cases and human expectations.

1. Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

2. What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

3. Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

question mark

Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?

Select the correct answer

question mark

What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?

Select the correct answer

question mark

Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?

Select the correct answer

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 3. Chapitre 3
Nous sommes désolés de vous informer que quelque chose s'est mal passé. Qu'est-il arrivé ?
some-alt