Oppiskele Wasserstein GANs (WGAN) | Variants, Applications, and Limitations of GANs

Understanding how to measure the distance between two probability distributions is central to the success of GANs. The original GAN formulation uses the Jensen-Shannon (JS) divergence, but this can cause problems when the distributions do not overlap, leading to vanishing gradients and unstable training. Wasserstein GANs (WGANs) address these issues by introducing the Wasserstein distance (also called Earth Mover's distance) as a new way to quantify how different the generated data distribution is from the real data distribution. The Wasserstein distance has several advantages:

It provides meaningful gradients even when the two distributions have no overlap;
It leads to more stable and robust GAN training.

Definition

Wasserstein loss is the loss function in WGANs is based on the Wasserstein distance, which measures the minimum cost of transporting mass to transform one probability distribution into another.

Definition

Lipschitz constraint: to compute the Wasserstein distance, the discriminator (called the critic in WGANs) must be a 1-Lipschitz function. This is typically enforced by weight clipping or other regularization techniques.

The mathematical formulation of the Wasserstein distance between the real data distribution $P_r$ and the generated data distribution $P_g$ is:

W(P_r, P_g) = \inf_{\gamma \in \Pi(P_r, P_g)} \mathbb{E}_{(x, y) \sim \gamma}[\|x - y\|]

Here, $\Pi(P_r, P_g)$ denotes the set of all joint distributions $\gamma(x, y)$ whose marginals are $P_r$ and $P_g$ , and $\|x - y\|$ is the cost of transporting a unit of probability mass from $x$ to $y$ . This formulation captures the idea of the minimum effort required to transform one distribution into another, making it a powerful tool for training GANs.

Conceptually, WGAN modifies the GAN training process in several key ways. Instead of using a discriminator that outputs probabilities, WGAN uses a critic that scores real and generated samples. The critic is trained to maximize the difference between its average output on real samples and its average output on generated samples. This difference approximates the Wasserstein distance between the two distributions. To ensure the critic is a 1-Lipschitz function, its weights are clipped to a small range after each gradient update. As a result, the generator is trained to minimize the Wasserstein distance, leading to more stable gradients and improved training dynamics compared to the original GAN framework.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 3. Luku 3

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain why the JS divergence causes vanishing gradients in GANs?

How does the Wasserstein distance improve GAN training stability?

What does it mean for the critic to be 1-Lipschitz, and why is weight clipping used?

Awesome!

Completion rate improved to 8.33

Pyyhkäise näyttääksesi valikon

It provides meaningful gradients even when the two distributions have no overlap;
It leads to more stable and robust GAN training.

Definition

The mathematical formulation of the Wasserstein distance between the real data distribution $P_r$ and the generated data distribution $P_g$ is:

W(P_r, P_g) = \inf_{\gamma \in \Pi(P_r, P_g)} \mathbb{E}_{(x, y) \sim \gamma}[\|x - y\|]

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 3. Luku 3