Contenu du cours
Generative AI
Generative AI
Evaluation Metrics for Generative AI
Evaluating generative models differs from evaluating discriminative models, which rely on accuracy metrics. Since generative models produce many valid outputs, they must be assessed for quality, diversity, and relevance. This section introduces key metrics used in both research and industry to evaluate generative models across perceptual, statistical, and human-centered dimensions.
Evaluation for Image-Based Models (GANs, VAEs, Diffusion)
Perceptual and statistical evaluation methods are commonly applied to image-based generative models. These help measure how realistic, diverse, and well-distributed the generated outputs are compared to real images.
Inception Score (IS)
Quantifies both the clarity and diversity of generated images using the classification confidence of a pretrained Inception model.
where:
is the conditional label distribution for image
is the marginal class distribution.
Fréchet Inception Distance (FID)
Measures the similarity between real and generated image distributions using feature embeddings.
where:
and are mean and covariance of feature representations.
stands for the trace of a matrix — it is the sum of the diagonal elements. The trace helps quantify how different the feature distributions are in terms of their spread or shape.
LPIPS
Compares visual similarity between images using deep network features.
Evaluation for Text-Based Models (Transformers, GPT, BERT)
Language generation models are evaluated for quality, coherence, and relevance through statistical, semantic, and subjective metrics.
BLEU / ROUGE / METEOR
Compare n-gram overlap between generated and reference text.
where:
is precision for n-grams
is brevity penalty.
BERTScore
Measures semantic similarity using contextual embeddings. Uses cosine similarity between contextual embeddings, with precision/recall/F1 aggregates.
Prompt Fidelity
Measures adherence of output to input prompts, especially in instruction-tuned models.
Manually compare prompts to outputs or use similarity scoring models like CLIP or BERT.
Evaluation for Multimodal Models (e.g., DALL·E, Stable Diffusion)
Multimodal models need to be evaluated for alignment between modalities, such as image and text.
CLIPScore
Computes similarity between image embeddings and textual prompt embeddings.
where are modality-specific embeddings.
Prompt-to-Image Fidelity
Measures how well generated images match their conditioning prompts.
Use CLIP or manual annotation to judge visual-textual alignment.
Human Evaluation
Despite advances in automated metrics, human evaluation remains essential for subjective or creative tasks. Many generative outputs, especially in art, storytelling, or design, require human judgment to assess their meaningfulness, originality, and appeal. These methods provide nuanced insights that automated metrics often miss.
A/B Testing and Turing-Style Setups
Ask users to choose preferred or real-looking outputs from two options.
Real-World Example: in OpenAI's GPT-3 RLHF pipeline, crowdworkers were shown multiple model completions and asked to rank or select the most helpful or realistic one. This feedback directly shaped reward models for further fine-tuning.
Prompt-to-Output Fidelity
Subjective evaluation of how well the output reflects the given prompt.
Real-World Example: during RLHF training for InstructGPT, annotators rated completions for a prompt like "Write a polite email declining a job offer." Human scores determined which outputs aligned with the user's intent and style.
Rating Scales
Collect ratings on scales (e.g., 1–5) for realism, coherence, or creativity.
Real-World Example: in Anthropic's Claude evaluations, researchers collected 1–5 star ratings on helpfulness, honesty, and harmlessness for generations in dialogue, aiding model alignment goals.
Crowdsourced Evaluation
Use platforms like MTurk to gather diverse opinions. Ensure rater agreement.
Real-World Example: Google used large-scale crowdsourcing to assess LaMDA chatbot quality on dimensions like sensibleness and specificity by aggregating thousands of user judgments.
Use a hybrid of automatic and human-centered evaluations to obtain a fuller understanding of generative model performance. Human insight helps validate metric reliability and identify subtle failure cases not captured by numbers. For critical applications, combining multiple human raters and computing inter-rater reliability (e.g., Cohen’s kappa) can improve robustness.
Summary
These evaluation strategies are indispensable for iterating on model development and guiding deployment decisions. Combining objective metrics with human feedback helps developers balance realism, creativity, diversity, and alignment with user intent or task requirements. Effective evaluation ensures that generative AI models perform not just technically well, but also align with real-world use cases and human expectations.
1. Which of the following evaluation metrics is primarily used to measure the diversity of generated images in Generative Adversarial Networks (GANs)?
2. What is the primary use of Fréchet Inception Distance (FID) in evaluating generative models?
3. Which metric is commonly used to evaluate the semantic similarity between generated text and reference text?
Merci pour vos commentaires !