Aprende Score Matching, DSM, and their connection to Diffusion | Mathematical Foundations of Diffusion Models

Desliza para mostrar el menú

Score matching is a fundamental technique for learning generative models by estimating the gradient of the log-density — known as the score function — of a data distribution. Instead of directly modeling the probability density, score matching seeks to find a function that closely approximates the gradient of the log-probability with respect to the data. This approach is particularly valuable when the normalizing constant of the distribution is intractable, as is often the case in high-dimensional generative modeling.

Denoising score matching (DSM) builds on this idea by introducing controlled noise into the data and training a model to predict the score of the noisy data distribution. DSM leverages the insight that learning to undo noise is closely related to learning the score function itself. This makes DSM especially well suited for diffusion models, where the generative process is defined by gradually adding noise to data and then learning to reverse this process.

To see how DSM connects to diffusion models, consider the mathematical formulation. In DSM, you first corrupt data samples $x₀$ with Gaussian noise to obtain noisy samples $x̃$ :

x̃ = x₀ + σ * ε

where $ε ~ N(0, I)$ and $σ$ is the noise standard deviation. The goal is to train a neural network $sθ(x̃, σ)$ to estimate the score function of the noisy data, that is, the gradient of the log-probability density with respect to $x̃$ :

∇_{x̃} log (q_σ(x̃))

The DSM loss function encourages the model to predict the true score by minimizing the expected squared error between the model output and the true score:

L_{DSM}(θ) = E_{x₀, ε} \left[ || sθ(x̃, σ) - ∇_{x̃} log q_σ(x̃) ||² \right]

However, the true score $∇_{x̃} log (q_σ(x̃))$ is generally unknown. Fortunately, for Gaussian noise, this term can be rewritten using properties of the Gaussian distribution. In particular, it can be shown that:

∇_{x̃} log q_σ(x̃ | x₀) = - (x̃ - x₀) / σ²

This leads to a practical DSM loss:

L_{DSM}(θ) = E_{x₀, ε} \left[ || sθ(x̃, σ) + (x̃ - x₀) / σ² ||² \right]

This loss closely resembles the training objective in diffusion models, where the model is trained to predict either the noise or the original data from a noisy sample. In fact, the diffusion model's training objective can be interpreted as a form of denoising score matching, where the model learns to estimate the score of the data at various noise levels, corresponding to different time steps in the diffusion process.

To make this connection more concrete, imagine you have a dataset of images. You add Gaussian noise to each image to create a noisy version. The task is to train a model that, given the noisy image, can predict the direction in which the original image lies—that is, the gradient pointing back to the clean image. This direction is precisely the score function for the noisy data distribution. By learning to estimate this score, the model gains the ability to denoise, which is the core mechanism behind the reverse diffusion process. In diffusion models, this learned score function guides the generation of new samples by iteratively moving noisy data towards high-probability regions of the data distribution, effectively reconstructing realistic images from pure noise.

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 2. Capítulo 5

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 2. Capítulo 5