What Happens Under the Hood

One thing: predict the next token. Then repeat.

🔄 Try It — Type a Prompt

Type something, hit Generate, and watch the recursive loop in action.

📖 Key Concepts

🔤 Token

⚡ Forward Pass

📊 Probability Distribution

Attention — The Engine

How does the model know "it" means "the cat"?

🔍 Interactive Attention Map

Hover over "it" to see which words it attends to:

🎭 Query · Key · Value

❓

Query

The question each word asks:
"Which words are relevant to me?"

🔑

Key

The answer each word offers:
"Here's what I am"

📦

Value

The actual content combined:
"The cat" (enriched)

softmax(Q · Kᵀ / √d) · V

You don't need to understand the math — just know Q, K, V are the three matrices above.

The Problem — Re-computation

Every new token re-attends to ALL previous tokens. Every. Single. Time.

📈 The Quadratic Wall — Click "Add Token"

Current token: 0 Re-processed: 0 Total ops: 0

💡 The Insight

Token 1-10 are processed 50 times by the time you reach token 50.

The attention pattern between "the cat" and "sat" never changes — yet it's recomputed every step.

Why recompute what you've already computed? → Cache it.

KV Cache — Compute Once, Reuse Forever

O(n²) → O(n). The single biggest inference optimization.

⚡ Side-by-Side Comparison

❌ Without Cache

✅ With KV Cache

🧠 What Gets Cached?

For every token, the model computes:

Key vector Value vector

→

These get stored in the KV Cache:

K₁, V₁ (token 1)

K₂, V₂ (token 2)

... Kₙ, Vₙ (token n)

New token only computes its Query against the cached K, V pairs. No recomputation.

📊 Complexity Drop

KV Cache in llama.cpp

--ctx-size: how much the model can remember

💻 The Command

      $ llama-server --model Qwen3.6-27B-Q4_K_M.gguf \

                      --ctx-size 8192 \

                      --port 8080

--ctx-size 8192 = 8192 tokens of context the model can remember.

🧮 RAM Calculator — Qwen3.6-27B

Context size: 8192 tokens Quantization:

⚠️ Running Out of Context

When context is full, the model forgets the oldest tokens. The conversation has amnesia.

Prompt Cache / Prefix Cache

KV Cache = within one generation. Prompt Cache = across requests.

🔀 KV Cache vs. Prompt Cache

KV Cache

Within one generation
Stores K,V for all tokens
Grows as conversation grows
Always active

Prompt Cache

Across multiple requests
Caches shared prefixes
System prompts, templates
Requires repeated prefixes

📡 Visual: Shared Prefix Across 3 Requests

🚀 Where It Shines

📋

System Prompts

Same 100-token system prompt × 1000 requests = 100,000 saved computations

💻

Code Templates

Shared boilerplate + varying function bodies

📝

Repeated Instructions

Same instructions, different data each time

Trade-offs & Wrap-up

Speed vs. Memory. Cache hit vs. cache miss.

⚖️ Interactive Trade-off Map

Conversation length: medium Prefix repetition: medium Available RAM: medium

🎯 Three Takeaways

1️⃣

Attention is the Engine

Every word looks at every other word. That's the cost.

2️⃣

KV Cache is the Shortcut

Compute once, reuse forever. O(n²) → O(n).

3️⃣

Prompt Cache is the Multiplier

Share the shortcut across requests.