What Happens Under the Hood

One thing: predict the next token. Then repeat.

๐Ÿ”„ Try It โ€” Type a Prompt

Type something, hit Generate, and watch the recursive loop in action.

๐Ÿ“– Key Concepts

๐Ÿ”ค Token
โšก Forward Pass
๐Ÿ“Š Probability Distribution

Attention โ€” The Engine

How does the model know "it" means "the cat"?

๐Ÿ” Interactive Attention Map

Hover over "it" to see which words it attends to:

๐ŸŽญ Query ยท Key ยท Value

โ“
Query

The question each word asks:
"Which words are relevant to me?"

๐Ÿ”‘
Key

The answer each word offers:
"Here's what I am"

๐Ÿ“ฆ
Value

The actual content combined:
"The cat" (enriched)

softmax(Q ยท Kแต€ / โˆšd) ยท V

You don't need to understand the math โ€” just know Q, K, V are the three matrices above.

The Problem โ€” Re-computation

Every new token re-attends to ALL previous tokens. Every. Single. Time.

๐Ÿ“ˆ The Quadratic Wall โ€” Click "Add Token"

Current token: 0 Re-processed: 0 Total ops: 0

๐Ÿ’ก The Insight

Token 1-10 are processed 50 times by the time you reach token 50.

The attention pattern between "the cat" and "sat" never changes โ€” yet it's recomputed every step.

Why recompute what you've already computed? โ†’ Cache it.

KV Cache โ€” Compute Once, Reuse Forever

O(nยฒ) โ†’ O(n). The single biggest inference optimization.

โšก Side-by-Side Comparison

โŒ Without Cache

โœ… With KV Cache

๐Ÿง  What Gets Cached?

For every token, the model computes:

Key vector Value vector
โ†’

These get stored in the KV Cache:

Kโ‚, Vโ‚ (token 1)
Kโ‚‚, Vโ‚‚ (token 2)
... Kโ‚™, Vโ‚™ (token n)

New token only computes its Query against the cached K, V pairs. No recomputation.

๐Ÿ“Š Complexity Drop

KV Cache in llama.cpp

--ctx-size: how much the model can remember

๐Ÿ’ป The Command

$ llama-server --model Qwen3.6-27B-Q4_K_M.gguf \
                --ctx-size 8192 \
                --port 8080

--ctx-size 8192 = 8192 tokens of context the model can remember.

๐Ÿงฎ RAM Calculator โ€” Qwen3.6-27B

โš ๏ธ Running Out of Context

When context is full, the model forgets the oldest tokens. The conversation has amnesia.

Prompt Cache / Prefix Cache

KV Cache = within one generation. Prompt Cache = across requests.

๐Ÿ”€ KV Cache vs. Prompt Cache

KV Cache

  • Within one generation
  • Stores K,V for all tokens
  • Grows as conversation grows
  • Always active

Prompt Cache

  • Across multiple requests
  • Caches shared prefixes
  • System prompts, templates
  • Requires repeated prefixes

๐Ÿ“ก Visual: Shared Prefix Across 3 Requests

๐Ÿš€ Where It Shines

๐Ÿ“‹
System Prompts

Same 100-token system prompt ร— 1000 requests = 100,000 saved computations

๐Ÿ’ป
Code Templates

Shared boilerplate + varying function bodies

๐Ÿ“
Repeated Instructions

Same instructions, different data each time

Trade-offs & Wrap-up

Speed vs. Memory. Cache hit vs. cache miss.

โš–๏ธ Interactive Trade-off Map

๐ŸŽฏ Three Takeaways

1๏ธโƒฃ
Attention is the Engine

Every word looks at every other word. That's the cost.

2๏ธโƒฃ
KV Cache is the Shortcut

Compute once, reuse forever. O(nยฒ) โ†’ O(n).

3๏ธโƒฃ
Prompt Cache is the Multiplier

Share the shortcut across requests.