Aprende Variational Lower Bound (ELBO) and Training Objective | Mathematical Foundations of Diffusion Models

Desliza para mostrar el menú

You have seen how diffusion models gradually add noise to data through a forward process and attempt to reverse this process via a parameterized model. To train these models effectively, you need a principled objective that aligns the learned reverse process with the true underlying data distribution. This objective arises naturally from variational inference, resulting in the Evidence Lower Bound (ELBO).

To derive the ELBO for diffusion models, start by considering the generative process as a Markov chain that transforms noise into data. The model defines a parameterized reverse process, denoted as:

pθ(x₀, x₁, ..., x_T) = p(x_T) ∏_{t=1}^T pθ(x_{t-1} | x_t)

Where $x_T$ is pure noise. The true data likelihood, $pθ(x₀)$ , is intractable because it involves integrating over all possible latent trajectories. Instead, you introduce a variational distribution, the forward process $q(x₁, ..., x_T | x₀)$ , which is tractable and known.

The ELBO is constructed as follows:

Start from the log-likelihood of the data, $log pθ(x₀)$ ;
Rewrite it as an expectation over the forward process trajectory:
$\log p_\theta(x_0) = \log \int q(x_{1:T} \mid x_0),\quad \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)}, \quad dx_{1:T}$
Apply Jensen's inequality to obtain a lower bound:
$\log p_{\theta}(x_0) \ge \mathbb{E}_{q(x_{1:T} \mid x_0)}\!\left[ \log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T} \mid x_0)} \right]$

This expectation is the ELBO. For diffusion models, the ELBO becomes a sum of KL divergences and expected log-likelihood terms at each diffusion step, reflecting the discrepancy between the true forward process and the learned reverse process.

Next, break down each term in the ELBO and interpret its meaning. The ELBO for diffusion models typically takes the form:

A sum over time steps of KL divergences between the forward and reverse transition probabilities:
$\mathbb{E}_{q(x_0)} \Bigg[ \sum_{t=1}^{T} \mathrm{KL}\!\big( q(x_t \mid x_{t-1}, x_0) \,\|\, p_{\theta}(x_{t-1} \mid x_t) \big) \Bigg];$
A terminal term involving the prior and the last latent variable:
$\mathrm{KL}\!\big( q(x_T \mid x_0) \,\|\, p(x_T) \big);$
An optional reconstruction term (if the model is designed for it):
$\mathbb{E}_{q(x_1 \mid x_0)}\!\left[ \log p_{\theta}(x_0 \mid x_1) \right].$

Each KL term measures how well the model's reverse transition at each step matches the true forward transition conditioned on the data. The terminal KL ensures that the distribution at the final time step matches the noise prior. The reconstruction term (when present) encourages the model to reconstruct the original data from noisy samples.

Maximizing the data likelihood is the ultimate goal when training generative models, as it ensures that the model assigns high probability to real data. However, the exact likelihood is intractable for diffusion models due to the latent trajectory integration. By maximizing the ELBO, you maximize a lower bound on the log-likelihood. As the model improves and the variational approximation becomes tighter, the gap between the ELBO and the true likelihood narrows. Thus, minimizing the negative ELBO is equivalent to maximizing the likelihood up to the looseness of the bound, providing a principled training objective for diffusion models.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 2. Capítulo 3

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 2. Capítulo 3