Learn Geometry of Forgetting | Understanding Catastrophic Forgetting

Swipe to show menu

When you train a neural network on multiple tasks in sequence, the network's parameters move through a high-dimensional space as they adapt to each new task. Imagine each set of parameters as a point in this space. During sequential learning, the trajectory of these points reflects how the network's knowledge changes: when learning a new task, the optimizer updates parameters to reduce the loss for that task, often moving away from regions that were optimal for previous tasks. This can cause the network to overwrite solutions it had previously discovered, especially if the new task's requirements are very different from the old ones. As a result, performance on earlier tasks can degrade, even though the network may excel at the most recently learned task.

A core driver of this problem is gradient interference. When you compute gradients for the new task, these gradients indicate how to change the parameters to improve performance on that task. However, the gradients for the old tasks may point in different, even opposing, directions. If the gradients for the new and old tasks are aligned, updates can benefit both; but if they are conflicting, the update that helps the new task can actively harm performance on the old tasks. This destructive interference means that learning new information can erase or corrupt what was learned before, simply due to the geometry of the gradients in parameter space.

Another geometric phenomenon is representation drift. Neural networks develop internal feature representations that capture abstractions useful for solving tasks. As you train on new tasks, these internal representations can shift, or drift, to accommodate the new data. While this plasticity enables the network to learn new things, it also means that the representations supporting older tasks can be lost or altered. Over time, the features that made the network successful on earlier tasks may no longer exist in the same form, leading to a loss of those capabilities.

The loss landscape — the surface defined by the loss function as a function of the network parameters — also changes as you train on new tasks. Initially, the optimizer may find a minimum in the loss landscape that works well for the first task. But as you introduce new tasks, the shape of the loss landscape deforms: new valleys and hills appear, and the previous minima can shift or even disappear. This makes it increasingly difficult, or even impossible, for the network to return to parameter settings that worked for earlier tasks, even if you retrain or fine-tune. The evolving geometry of the loss landscape is a fundamental reason why catastrophic forgetting is so persistent and challenging.

Key takeaways from this chapter are:

Forgetting is fundamentally a geometric phenomenon in the network's parameter space;
Gradient interference and representation drift are two of the most important mechanisms driving catastrophic forgetting;
The loss landscape itself evolves with each new task, making it difficult to preserve solutions to old problems as new ones are learned.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 3