Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda KV-Cache and Autoregressive Inference | Scaling, Memory, and Computation
Attention Mechanisms Theory

bookKV-Cache and Autoregressive Inference

The KV-cache, short for key-value cache, is a technique used to accelerate autoregressive inference in attention-based models such as transformers. In autoregressive generation, the model predicts one token at a time, using previously generated tokens as context. Normally, at each step, the attention mechanism recalculates key and value representations for all past tokens, which becomes increasingly expensive as the sequence grows. The KV-cache optimizes this process by storing the computed key and value tensors from previous steps, allowing the model to reuse them instead of recomputing. This dramatically reduces redundant computation during inference, resulting in faster and more efficient generation.

The introduction of the KV-cache brings important trade-offs between memory and speed during inference. By caching all previous key and value tensors, the model avoids recalculating them at every new token generation step, which greatly speeds up inference — especially for long sequences. However, this speed comes at the cost of increased memory usage, since the cache must store the key and value representations for every token in the sequence. This can become a bottleneck for very long contexts or on hardware with limited memory. It's crucial to note that these trade-offs are specific to the inference phase. During training, the model typically processes entire sequences in parallel, so there is no need for incremental caching or reuse of past key-value pairs.

Note
Note

The KV-cache is only relevant during inference because, in training, the model processes full sequences all at once using parallel computation. There is no incremental token-by-token generation, so caching previous key-value tensors is unnecessary.

Inference Workflow
expand arrow

During inference, the model generates output one token at a time. At each step, the KV-cache stores the key and value tensors from all previous tokens, allowing the model to reuse them for efficient computation. This avoids recalculating these tensors, making inference much faster, but requires extra memory to store the cache.

Training Workflow
expand arrow

In training, the entire sequence is processed in parallel. The model computes all key and value tensors for the whole sequence at once, so there is no need to cache or reuse these tensors incrementally. The KV-cache does not play a role in this process.

1. What is the primary benefit of using a KV-cache during inference?

2. Why does the KV-cache not affect the training process?

3. How does the KV-cache change the memory and speed characteristics of autoregressive inference?

question mark

What is the primary benefit of using a KV-cache during inference?

Select the correct answer

question mark

Why does the KV-cache not affect the training process?

Select the correct answer

question mark

How does the KV-cache change the memory and speed characteristics of autoregressive inference?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

bookKV-Cache and Autoregressive Inference

Deslize para mostrar o menu

The KV-cache, short for key-value cache, is a technique used to accelerate autoregressive inference in attention-based models such as transformers. In autoregressive generation, the model predicts one token at a time, using previously generated tokens as context. Normally, at each step, the attention mechanism recalculates key and value representations for all past tokens, which becomes increasingly expensive as the sequence grows. The KV-cache optimizes this process by storing the computed key and value tensors from previous steps, allowing the model to reuse them instead of recomputing. This dramatically reduces redundant computation during inference, resulting in faster and more efficient generation.

The introduction of the KV-cache brings important trade-offs between memory and speed during inference. By caching all previous key and value tensors, the model avoids recalculating them at every new token generation step, which greatly speeds up inference — especially for long sequences. However, this speed comes at the cost of increased memory usage, since the cache must store the key and value representations for every token in the sequence. This can become a bottleneck for very long contexts or on hardware with limited memory. It's crucial to note that these trade-offs are specific to the inference phase. During training, the model typically processes entire sequences in parallel, so there is no need for incremental caching or reuse of past key-value pairs.

Note
Note

The KV-cache is only relevant during inference because, in training, the model processes full sequences all at once using parallel computation. There is no incremental token-by-token generation, so caching previous key-value tensors is unnecessary.

Inference Workflow
expand arrow

During inference, the model generates output one token at a time. At each step, the KV-cache stores the key and value tensors from all previous tokens, allowing the model to reuse them for efficient computation. This avoids recalculating these tensors, making inference much faster, but requires extra memory to store the cache.

Training Workflow
expand arrow

In training, the entire sequence is processed in parallel. The model computes all key and value tensors for the whole sequence at once, so there is no need to cache or reuse these tensors incrementally. The KV-cache does not play a role in this process.

1. What is the primary benefit of using a KV-cache during inference?

2. Why does the KV-cache not affect the training process?

3. How does the KV-cache change the memory and speed characteristics of autoregressive inference?

question mark

What is the primary benefit of using a KV-cache during inference?

Select the correct answer

question mark

Why does the KV-cache not affect the training process?

Select the correct answer

question mark

How does the KV-cache change the memory and speed characteristics of autoregressive inference?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2
some-alt