One thing: predict the next token. Then repeat.
Type something, hit Generate, and watch the recursive loop in action.
How does the model know "it" means "the cat"?
Hover over "it" to see which words it attends to:
The question each word asks:
"Which words are relevant to me?"
The answer each word offers:
"Here's what I am"
The actual content combined:
"The cat" (enriched)
softmax(Q ยท Kแต / โd) ยท V
You don't need to understand the math โ just know Q, K, V are the three matrices above.
Every new token re-attends to ALL previous tokens. Every. Single. Time.
Token 1-10 are processed 50 times by the time you reach token 50.
The attention pattern between "the cat" and "sat" never changes โ yet it's recomputed every step.
Why recompute what you've already computed? โ Cache it.
O(nยฒ) โ O(n). The single biggest inference optimization.
For every token, the model computes:
These get stored in the KV Cache:
New token only computes its Query against the cached K, V pairs. No recomputation.
--ctx-size: how much the model can remember
--ctx-size 8192 = 8192 tokens of context the model can remember.
When context is full, the model forgets the oldest tokens. The conversation has amnesia.
KV Cache = within one generation. Prompt Cache = across requests.
Same 100-token system prompt ร 1000 requests = 100,000 saved computations
Shared boilerplate + varying function bodies
Same instructions, different data each time
Speed vs. Memory. Cache hit vs. cache miss.
Every word looks at every other word. That's the cost.
Compute once, reuse forever. O(nยฒ) โ O(n).
Share the shortcut across requests.