Prompt caching is an essential optimization technique for developers building Large Language Model (LLM) applications. While individual user queries are often unique, professional AI applications typically structure prompts as a "sandwich"—a large block of static information followed by a small, dynamic user query.
By caching the "initial text" (the static part), the model avoids recalculating the complex internal map of the data every time a user asks a new question. This leads to faster response times and significantly lower costs for high-traffic AI services.
I. Prompt Caching vs. Output Caching
It is important to distinguish between caching an answer and caching a prompt. Standard output caching stores a specific response to a specific question. If the user changes a single word in their question, the output cache becomes invalid.
Prompt caching is more flexible. It stores the model's "thinking" for the first part of a prompt. This means that even if the final question changes, the model can still use its cached understanding of the background documents or instructions that preceded that question.
II. The Pre-fill Phase and KV Pairs
When an LLM processes a prompt, it enters the "pre-fill" phase. During this time, it generates Key-Value (KV) pairs for every token. These pairs act as a mathematical representation of the context, allowing the model to understand how every word relates to every other word in the text.
For a 50-page document, this pre-fill phase requires millions of calculations. Prompt caching stores these pre-computed KV pairs in memory. When a new request arrives that starts with the same text, the model retrieves the "saved" context and begins generating the answer immediately, skipping the expensive recalculation of the initial text.
III. The "Sandwich" Architecture: Making Initial Texts Identical
A common critique of prompt caching is that "user requests are fundamentally different." While the user's specific query changes, the architecture of a professional AI prompt is designed to keep the initial text identical across thousands of requests. This is known as the "Sandwich" or "Prefix" model.
In an API call, a prompt is structured like this:
- System Instructions (Static): 500 words of rules.
- Context/Knowledge (Static): A 10,000-word technical manual.
- User Question (Dynamic): The unique 10-word question.
Because the first 10,500 words are identical for every user, the cache stays "warm." The model only has to compute the unique tokens at the very end.
IV. Clarifying High-Impact Use Cases
Prompt caching is not designed for simple one-off questions; it is designed for applications where a massive "Prefix" is reused.
- Document-Based Q&A (RAG): If 1,000 employees are asking questions about a single 300-page corporate policy document, the policy text is the "initial text." Caching it once saves the model from "re-reading" 300 pages for every single employee question.
- System Prompt "Guardrails": Enterprise bots often have massive system prompts defining legal limits and personality. Since every interaction starts with these identical rules, caching the rules ensures the bot stays efficient.
- Few-Shot Task Training: To get a model to output perfect JSON or medical coding, developers provide 10-20 examples of "Input -> Output." These examples are sent with every single request. Caching these examples means the model "learns" the format once and applies it to thousands of different data points.
- Long-Running Conversations: In a technical support chat, the first 20 messages of the history become the "initial text" for the 21st message. The model only processes the new message relative to the cached history.
V. Prefix Matching and Prompt Structure
The system uses Prefix Matching to validate the cache. It compares the new prompt against the cache token-by-token, starting from the very first word. As soon as the text differs, the cache stops working.
Therefore, you must place all static information at the top and the dynamic question at the bottom:
### CORRECT STRUCTURE (Maximum Savings)
1. [CACHE] "You are a legal assistant..." (System Prompt)
2. [CACHE] [Full Text of the 2024 Tax Code] (Knowledge Base)
3. [NEW] "How does Section 4.2 apply to me?" (User Query)
### INCORRECT STRUCTURE (Zero Savings)
1. [NEW] "How does Section 4.2 apply to me?" (User Query)
2. [CACHE] "You are a legal assistant..." (System Prompt)
3. [CACHE] [Full Text of the 2024 Tax Code] (Knowledge Base)VI. Constraints and Thresholds
Prompt caching is most effective when the following conditions are met:
- Token Count: Most providers require a prefix of at least 1,024 tokens to trigger caching. Caching very short sentences often costs more in overhead than it saves in processing.
- Time-to-Live (TTL): Caches are generally volatile and may expire after 5–10 minutes of inactivity.
- Exact Token Matching: Even a single extra space or a different punctuation mark at the beginning of your "static" text will cause a cache miss. Consistency in prompt formatting is mandatory.