Caching Strategies for LLM Applications: Reducing Latency and Cost
Caching is one of the most underused levers for reducing cost and latency in production LLM applications. This article covers prompt caching, semantic caching, and response caching — what each layer does, when to use it, and how to think about invalidation and observability.

Building LLM-powered features into a production product is one thing. Keeping them fast and affordable at scale is another challenge entirely. Inference costs accumulate quickly, and latency — even a few extra seconds — erodes user experience in ways that compound over time. Caching is one of the most practical levers available to engineering teams working with LLMs, and it is underused in most early-stage AI implementations.
This article covers the three main caching layers relevant to LLM applications — prompt caching, semantic caching, and response caching — and offers guidance on what to cache, when to invalidate, and how to think about the trade-offs at each layer.
Why Caching Matters More for LLM Applications Than Conventional APIs
LLM inference is expensive and slow relative to most API calls. A single request to a frontier model can take anywhere from one to thirty seconds depending on context length and output size, and costs are metered per token — meaning long prompts or high-volume usage compounds quickly. Unlike a database query that returns in milliseconds, LLM calls carry significant latency by default.

Caching for LLM applications is the practice of storing and reusing computations — whether at the token, embedding, or response level — to avoid redundant inference work. Done well, it reduces both cost and perceived latency without meaningfully degrading output quality. Done poorly, it serves stale or incorrect responses and erodes trust in the system.
The three layers address different parts of the problem:
| Caching Layer | What It Reuses | Best For |
|---|---|---|
| Prompt caching | Prefilled token computations (KV cache) | Long system prompts, repeated context |
| Semantic caching | Stored responses to semantically equivalent queries | FAQ-style or bounded-domain queries |
| Response caching | Exact-match stored completions | Deterministic, templated outputs |
What Is Prompt Caching?
Prompt caching is a platform-level feature offered by some LLM providers — including Anthropic's Claude API — that allows repeated portions of a prompt to be precomputed and reused across requests. Rather than processing the same system prompt or document context from scratch on every call, the model's key-value (KV) cache stores the intermediate computations for those tokens, and subsequent requests that share the same prefix can skip reprocessing them.
This is particularly valuable in two scenarios: applications with long, stable system prompts (instructions, persona definitions, compliance rules), and retrieval-augmented generation (RAG) pipelines where the same document or knowledge chunk is included in multiple requests.
When Does Prompt Caching Apply?
Prompt caching is most effective when:
- Your system prompt is long and does not change between requests
- You are injecting the same retrieved documents into multiple user queries
- You are running agentic or multi-step workflows where the same context is passed across turns
- You are processing a large document repeatedly with different user questions about it
It is least effective when every prompt is unique — for example, freeform creative generation where the context changes with every call. In those cases, there is nothing stable enough to cache at the token level.
Provider Support
Not all providers support prompt caching in the same way, and it is not always on by default. Engineers should check whether their chosen API exposes cache control headers or similar mechanisms, and structure prompts deliberately — placing stable content at the beginning of the context window and variable content at the end — to maximise cache hit rates.
What Is Semantic Caching?
Semantic caching is the practice of storing LLM responses and retrieving them when a new query is semantically equivalent to a previous one — even if the wording is different. Rather than matching on exact string equality, semantic caching embeds incoming queries into a vector space and checks for close neighbours in a vector store. If a sufficiently similar query has been answered before, the cached response is returned instead of calling the LLM.

This approach addresses a key limitation of exact-match caching for natural language: users rarely ask the same question in exactly the same words. Semantic caching captures intent similarity rather than surface similarity.
Architecture of a Semantic Cache
A typical semantic cache sits between the application layer and the LLM API. The flow works as follows:
- Incoming query is embedded using a fast embedding model
- The embedding is compared against a vector index of previously cached query embeddings
- If a match exceeds a configurable similarity threshold, the cached response is returned
- If no match is found, the query proceeds to the LLM, and the response is stored with its embedding for future use
The similarity threshold is the most sensitive tuning parameter. A threshold set too high (requiring near-identical queries) reduces cache hit rate. A threshold set too low risks returning responses that are accurate for a similar-but-different question — which is worse than a cache miss.
What to Cache Semantically
Semantic caching works well for applications with a bounded query domain — customer support tools, internal knowledge bases, product FAQ assistants, and similar use cases where a finite set of questions recurs in varied phrasing. It works poorly for open-ended or highly personalised applications where each query is genuinely distinct.
Invalidation Considerations
Semantic caches need to be invalidated when the underlying facts change. If your product pricing changes and a cached response contains old pricing, semantic cache hits will serve incorrect information. Invalidation strategies include:
- Time-to-live (TTL): Expire cache entries after a fixed period based on how frequently the underlying information changes
- Source-triggered invalidation: When a source document is updated, invalidate all cached entries whose embedding neighbourhood overlaps with that document's topic
- Manual flushing: For high-stakes domains (legal, medical, financial), prefer conservative TTLs and manual review before cache entries are restored
What Is Response Caching?
Response caching is the simplest layer: store the exact output of an LLM call and return it verbatim when the same exact input is received again. This is standard HTTP caching applied to LLM API responses, and it is effective in narrowly defined situations.
Response caching is appropriate when:
- The prompt is templated and deterministic (e.g. "Summarise this product description: [fixed SKU description]")
- The model temperature is set to zero, making outputs consistent
- The use case tolerates exact repetition — for example, pre-generated report sections, batch-processed summaries, or static content pipelines
It is not appropriate for conversational interfaces, personalised outputs, or any scenario where the user expects a fresh response.
Implementation Patterns
Response caching can be implemented at several levels:
- Application layer: Cache keyed on a hash of the prompt, stored in Redis or a similar in-memory store with TTL controls
- CDN layer: For public-facing LLM-powered endpoints with shared responses (e.g. public FAQ answers), CDN caching can serve responses at edge nodes, reducing both latency and origin load
- API gateway layer: Middleware that intercepts requests before they reach the LLM service, checks the cache, and short-circuits on hits
Combining the Three Layers
In a well-designed LLM application, all three caching layers can coexist and complement each other. A useful mental model is to think of them as a cascade:
- Response cache is checked first (exact match, cheapest lookup)
- Semantic cache is checked second (approximate match, moderate cost)
- Prompt cache operates at the provider level and reduces cost even when the other two layers miss
This layered approach means that the most common queries are served entirely from cache, moderately common queries benefit from semantic matching, and novel queries still benefit from reduced inference cost through prompt caching at the API level.
Cache Invalidation: The Hard Part
Cache invalidation is notoriously difficult in software engineering, and LLM caching introduces additional complexity because the "correctness" of a cached response is harder to verify than a database record. A cached SQL query result is either current or stale. A cached LLM response may be technically accurate but no longer the best answer given updated context, product changes, or shifting user expectations.
Practical principles for LLM cache invalidation:
- Match TTL to knowledge volatility: News or pricing data may need TTLs measured in hours. Policy documents might warrant days or weeks.
- Separate cache stores by risk level: Cache responses to informational queries more aggressively than responses that affect decisions (bookings, purchases, support escalations)
- Log cache hits for audit: In regulated industries, maintain a log of when a user received a cached response versus a live one — this matters for compliance and incident investigation
- Monitor hit rate and staleness separately: A high cache hit rate is not inherently good if TTLs are set too long. Track both metrics.
Agentic and Multi-Step Workflows
Caching becomes more complex — and more valuable — in agentic LLM applications where a single user interaction triggers multiple LLM calls across a workflow. Each hop in the chain carries its own latency and cost, and without caching, these multiply quickly.
In multi-step workflows, prompt caching at the API level is especially impactful because the same system prompt and accumulated conversation context is often passed with every step. Structuring prompts so that stable context appears first — and changes only at the end — maximises the chance of cache reuse across steps.
For semantic and response caching in agentic workflows, be cautious: intermediate results that inform later steps may not be safe to cache if they depend on real-time data (tool call outputs, live search results, current user state). Cache the planning layer where appropriate; be conservative about caching execution results.
If you are building or evaluating agentic AI systems, AI Engineering covers how we approach production-grade agent architecture, including latency management.
Cost and Latency Trade-offs at Each Layer
Rather than citing specific numbers — which vary significantly by provider, model, and workload — it is more useful to understand the directional trade-offs:
| Factor | Response Cache | Semantic Cache | Prompt Cache |
|---|---|---|---|
| Implementation complexity | Low | Medium-High | Low (provider feature) |
| Hit rate for typical apps | Low-Medium | Medium | Medium-High |
| Latency reduction on hit | Very High | High | Moderate |
| Cost reduction on hit | Very High | Very High | Moderate |
| Risk of stale responses | High if TTL not set | Medium (threshold sensitive) | Low (partial reuse) |
| Best suited to | Deterministic, templated | Bounded-domain Q&A | All LLM apps |
The right caching strategy depends on your specific query distribution, acceptable staleness tolerance, and the nature of your domain. There is no universal configuration that works across all LLM applications.
Observability: You Cannot Optimise What You Cannot Measure
Caching infrastructure without observability is a liability. Before tuning any caching layer, instrument your application to capture:
- Cache hit rate per layer (response, semantic, prompt)
- Latency distribution for cached versus uncached requests
- Token usage trends over time, broken down by cached versus live inference
- Similarity score distribution for semantic cache queries (to inform threshold tuning)
- Staleness incidents: cases where a cached response was served after the underlying fact changed
This observability foundation is also what enables confident iteration — you can raise or lower thresholds, adjust TTLs, and restructure prompts knowing whether the change improved outcomes.
Where Caching Fits in a Broader AI Engineering Strategy
Caching is one component of a broader cost and performance strategy for LLM applications. It works alongside — not instead of — model selection (choosing the right model size for each task), prompt engineering (reducing token usage at the source), and infrastructure design (async processing, streaming, request batching).
If your team is still working out the foundational architecture for an LLM-powered product, caching decisions should emerge from that design process rather than be retrofitted later. Our AI Product Strategy service covers this kind of end-to-end design work — from picking the right model and retrieval architecture through to production observability.
For teams that already have something running and want to reduce inference costs or improve response times, a targeted engineering review is usually the fastest path to impact. You can explore how we approach that kind of work through our insights or get in touch directly.
Summary
Caching for LLM applications operates at three levels — prompt, semantic, and response — each addressing different cost and latency drivers. Prompt caching reduces inference cost for repeated context at the API level. Semantic caching handles intent-similar queries across varied phrasing. Response caching handles exact, deterministic repetition. Used together in a layered cascade, they can meaningfully reduce the cost and latency profile of a production LLM system without sacrificing output quality — provided invalidation is handled carefully and observability is in place from the start.
If you are building LLM-powered features and want to get the architecture right from the start — including caching, retrieval, and cost controls — we are happy to help. Reach out to start a conversation with our engineering team.
Chris Kerr
Partner at Horizon Labs, an AI product consultancy and venture studio. A commercially focused product and technology leader with 20+ years building and scaling digital platforms, teams, and businesses across SaaS, travel, eCommerce, logistics and transport, and digital marketing — operating at the intersection of product, engineering, and data. Writes about platform strategy, AI transformation, modern data ecosystems, and the operational discipline that separates AI demos from AI products.


