Prompt Caching
Prompt caching is a context-engineering technique that reuses previously processed prompt prefixes to reduce latency and cost for repeated or long-context AI requests.
Links
Website: docs.anthropic.comOverview
Prompt caching allows an AI application to store and reuse the model-side processing of repeated prompt content, such as system instructions, tool definitions, policy text, documentation, codebases, or conversation history prefixes. Instead of sending and fully reprocessing the same large context on every request, the provider can recognize stable prompt segments and charge or execute them more efficiently on subsequent calls.
π‘ What is this?
When you send a message to an AI model, the model has to read all of the text you include: instructions, examples, documents, previous conversation, tool descriptions, and the userβs latest question. If much of that text is the same every time, prompt caching lets the system remember that repeated part so the model does not have to process it from scratch on every request.
βοΈ How it works
Prompt caching works by identifying stable prefix segments of a prompt and storing their processed representation for reuse across compatible future requests. In systems such as Anthropic Claude prompt caching, developers mark cacheable content boundaries in the request, often around large static blocks such as system prompts, tool schemas, documents, or few-shot examples. When a later request contains the same prefix up to the cache breakpoint, the provider can retrieve the cached computation instead of recomputing the entire prefix. This typically reduces time-to-first-token and lowers the effective cost of repeated input tokens.
π― Why it matters
Prompt caching matters because modern AI applications increasingly rely on long context: large system prompts, retrieval-augmented generation, repository-level code context, agent tool definitions, compliance instructions, and multi-turn histories. Without caching, applications repeatedly pay latency and token-processing costs for the same content. Caching makes long-context applications more practical, improves responsiveness, and changes how developers design prompts by encouraging stable reusable prefixes.
π οΈ Practical use cases
- β’Caching long system prompts, behavioral policies, and formatting instructions that are reused across many requests
- β’Caching large documents, knowledge bases, or code files that users repeatedly query in a document QA or coding assistant workflow
- β’Caching tool definitions and agent instructions for applications that make many calls with the same tool schema
- β’Caching few-shot examples in classification, extraction, or structured generation workflows
- β’Caching the stable prefix of a multi-turn conversation while appending only the newest user message
- β’Caching repository context for AI coding assistants that repeatedly answer questions about the same codebase
β When to use
Use prompt caching when your application repeatedly sends large, mostly identical prompt prefixes to the same model or provider, especially when the repeated content is expensive to process, changes infrequently, and is used across many requests. It is particularly useful for long-context prompts, agent systems with stable tools, document analysis sessions, coding assistants, and workflows with large few-shot examples.
β When not to use
Do not prioritize prompt caching when prompts are short, highly dynamic, rarely repeated, or when the repeated content changes on nearly every request. It may also be less useful if your application already uses compact retrieval, if latency is not a concern, if provider caching rules are incompatible with your prompt structure, or if cached content creates privacy, retention, or compliance concerns that your organization cannot accept.
π Advantages
- +Reduces latency for repeated long-context requests
- +Can lower input-token costs for cached prompt sections
- +Improves scalability of applications that rely on large static context
- +Encourages cleaner prompt architecture by separating stable and dynamic content
- +Works well with agents that repeatedly send the same tool definitions or instructions
- +Can make document QA, codebase QA, and long conversation workflows more responsive
π Disadvantages
- βRequires prompt structure discipline so reusable content remains identical across requests
- βProvider-specific behavior and pricing can create portability issues
- βCache misses can occur if small changes are made to the cached prefix
- βMay add implementation complexity around cache breakpoints, prompt ordering, and request construction
- βNot as useful for short or one-off prompts
- βDevelopers may overuse long static context instead of designing more efficient retrieval or summarization strategies
β οΈ Limitations
- β’Typically works best on exact or near-exact repeated prompt prefixes rather than arbitrary reused fragments
- β’Cache lifetimes may be limited and provider-specific
- β’Minimum token thresholds may apply before caching becomes available or cost-effective
- β’Cached content often must appear before dynamic content in the prompt
- β’Changes to system prompts, tool schemas, document text, or ordering may invalidate the cache
- β’Caching does not reduce output-token cost or guarantee better model quality
- β’The feature may only be available on specific models, APIs, or pricing tiers
- β’Privacy, data retention, and compliance details depend on the model providerβs implementation
π Alternatives to consider
π Related concepts to learn
π§ͺ Suggested experiments
- βMeasure latency and cost for a long static system prompt with and without prompt caching enabled
- βBuild a document QA flow where the document is cached once and multiple user questions are asked against it
- βCompare prompt caching against retrieval-augmented generation for a large knowledge base
- βTest how small changes to the cached prefix affect cache hit rates
- βBenchmark an agent workflow with large tool schemas cached versus resent normally
- βExperiment with placing stable content before dynamic user input to maximize cache reuse
- βTrack cache hit rate, time-to-first-token, total latency, and effective input-token cost across real traffic
- βEvaluate whether summarizing old conversation turns plus caching the stable prefix performs better than sending the full conversation every time
πΊοΈ Ecosystem Map: Prompting Context Engineering
Prompt engineering and context management are critical skills for getting the most out of AI coding tools. Effective prompting reduces hallucinations, improves output quality, and enables more complex tasks.
Key Concepts
Emerging Tools
Metadata
prompt-cachingThis data is loaded from the database. Ecosystem context may use the section-level generated map.