Aprenda KV-Cache and Autoregressive Inference | Scaling, Memory, and Computation

Attention Mechanisms Theory

Deslize para mostrar o menu

The KV-cache, short for key-value cache, is a technique used to accelerate autoregressive inference in attention-based models such as transformers. In autoregressive generation, the model predicts one token at a time, using previously generated tokens as context. Normally, at each step, the attention mechanism recalculates key and value representations for all past tokens, which becomes increasingly expensive as the sequence grows. The KV-cache optimizes this process by storing the computed key and value tensors from previous steps, allowing the model to reuse them instead of recomputing. This dramatically reduces redundant computation during inference, resulting in faster and more efficient generation.

The introduction of the KV-cache brings important trade-offs between memory and speed during inference. By caching all previous key and value tensors, the model avoids recalculating them at every new token generation step, which greatly speeds up inference — especially for long sequences. However, this speed comes at the cost of increased memory usage, since the cache must store the key and value representations for every token in the sequence. This can become a bottleneck for very long contexts or on hardware with limited memory. It's crucial to note that these trade-offs are specific to the inference phase. During training, the model typically processes entire sequences in parallel, so there is no need for incremental caching or reuse of past key-value pairs.

Note

The KV-cache is only relevant during inference because, in training, the model processes full sequences all at once using parallel computation. There is no incremental token-by-token generation, so caching previous key-value tensors is unnecessary.

Inference Workflow

During inference, the model generates output one token at a time. At each step, the KV-cache stores the key and value tensors from all previous tokens, allowing the model to reuse them for efficient computation. This avoids recalculating these tensors, making inference much faster, but requires extra memory to store the cache.

Training Workflow

In training, the entire sequence is processed in parallel. The model computes all key and value tensors for the whole sequence at once, so there is no need to cache or reuse these tensors incrementally. The KV-cache does not play a role in this process.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 2. Capítulo 2

KV-Cache and Autoregressive Inference

1. What is the primary benefit of using a KV-cache during inference?

2. Why does the KV-cache not affect the training process?

3. How does the KV-cache change the memory and speed characteristics of autoregressive inference?