23 June 2026Updated 23 June 202611 min read

Caching Strategies for LLM Applications: Reducing Latency and Cost

Caching is one of the most underused levers for reducing cost and latency in production LLM applications. This article covers prompt caching, semantic caching, and response caching — what each layer does, when to use it, and how to think about invalidation and observability.

Building LLM-powered features into a production product is one thing. Keeping them fast and affordable at scale is another challenge entirely. Inference costs accumulate quickly, and latency — even a few extra seconds — erodes user experience in ways that compound over time. Caching is one of the most practical levers available to engineering teams working with LLMs, and it is underused in most early-stage AI implementations.

This article covers the three main caching layers relevant to LLM applications — prompt caching, semantic caching, and response caching — and offers guidance on what to cache, when to invalidate, and how to think about the trade-offs at each layer.

Why Caching Matters More for LLM Applications Than Conventional APIs

LLM inference is expensive and slow relative to most API calls. A single request to a frontier model can take anywhere from one to thirty seconds depending on context length and output size, and costs are metered per token — meaning long prompts or high-volume usage compounds quickly. Unlike a database query that returns in milliseconds, LLM calls carry significant latency by default.

Close-up of a developer's hands typing on a mechanical keyboard, with softly blurred monitor screens displaying terminal output in the background, lit by warm side-angled desk lamp light.

Caching for LLM applications is the practice of storing and reusing computations — whether at the token, embedding, or response level — to avoid redundant inference work. Done well, it reduces both cost and perceived latency without meaningfully degrading output quality. Done poorly, it serves stale or incorrect responses and erodes trust in the system.

The three layers address different parts of the problem:

Caching Layer	What It Reuses	Best For
Prompt caching	Prefilled token computations (KV cache)	Long system prompts, repeated context
Semantic caching	Stored responses to semantically equivalent queries	FAQ-style or bounded-domain queries
Response caching	Exact-match stored completions	Deterministic, templated outputs

What Is Prompt Caching?

Prompt caching is a platform-level feature offered by some LLM providers — including Anthropic's Claude API — that allows repeated portions of a prompt to be precomputed and reused across requests. Rather than processing the same system prompt or document context from scratch on every call, the model's key-value (KV) cache stores the intermediate computations for those tokens, and subsequent requests that share the same prefix can skip reprocessing them.

This is particularly valuable in two scenarios: applications with long, stable system prompts (instructions, persona definitions, compliance rules), and retrieval-augmented generation (RAG) pipelines where the same document or knowledge chunk is included in multiple requests.

When Does Prompt Caching Apply?

Prompt caching is most effective when:

Your system prompt is long and does not change between requests
You are injecting the same retrieved documents into multiple user queries
You are running agentic or multi-step workflows where the same context is passed across turns
You are processing a large document repeatedly with different user questions about it

It is least effective when every prompt is unique — for example, freeform creative generation where the context changes with every call. In those cases, there is nothing stable enough to cache at the token level.

Provider Support

Not all providers support prompt caching in the same way, and it is not always on by default. Engineers should check whether their chosen API exposes cache control headers or similar mechanisms, and structure prompts deliberately — placing stable content at the beginning of the context window and variable content at the end — to maximise cache hit rates.

What Is Semantic Caching?

Semantic caching is the practice of storing LLM responses and retrieving them when a new query is semantically equivalent to a previous one — even if the wording is different. Rather than matching on exact string equality, semantic caching embeds incoming queries into a vector space and checks for close neighbours in a vector store. If a sufficiently similar query has been answered before, the cached response is returned instead of calling the LLM.

A female data engineer stands side-on at a whiteboard in a bright daylit Australian office, drawing a system diagram with marker, caught in a candid mid-action moment with a colleague working at a desk in the soft-focus background.

This approach addresses a key limitation of exact-match caching for natural language: users rarely ask the same question in exactly the same words. Semantic caching captures intent similarity rather than surface similarity.

Architecture of a Semantic Cache

A typical semantic cache sits between the application layer and the LLM API. The flow works as follows:

Incoming query is embedded using a fast embedding model
The embedding is compared against a vector index of previously cached query embeddings
If a match exceeds a configurable similarity threshold, the cached response is returned
If no match is found, the query proceeds to the LLM, and the response is stored with its embedding for future use

The similarity threshold is the most sensitive tuning parameter. A threshold set too high (requiring near-identical queries) reduces cache hit rate. A threshold set too low risks returning responses that are accurate for a similar-but-different question — which is worse than a cache miss.

What to Cache Semantically

Semantic caching works well for applications with a bounded query domain — customer support tools, internal knowledge bases, product FAQ assistants, and similar use cases where a finite set of questions recurs in varied phrasing. It works poorly for open-ended or highly personalised applications where each query is genuinely distinct.

Invalidation Considerations

Semantic caches need to be invalidated when the underlying facts change. If your product pricing changes and a cached response contains old pricing, semantic cache hits will serve incorrect information. Invalidation strategies include:

Time-to-live (TTL): Expire cache entries after a fixed period based on how frequently the underlying information changes
Source-triggered invalidation: When a source document is updated, invalidate all cached entries whose embedding neighbourhood overlaps with that document's topic
Manual flushing: For high-stakes domains (legal, medical, financial), prefer conservative TTLs and manual review before cache entries are restored

What Is Response Caching?

Response caching is the simplest layer: store the exact output of an LLM call and return it verbatim when the same exact input is received again. This is standard HTTP caching applied to LLM API responses, and it is effective in narrowly defined situations.

Response caching is appropriate when:

The prompt is templated and deterministic (e.g. "Summarise this product description: [fixed SKU description]")
The model temperature is set to zero, making outputs consistent
The use case tolerates exact repetition — for example, pre-generated report sections, batch-processed summaries, or static content pipelines

It is not appropriate for conversational interfaces, personalised outputs, or any scenario where the user expects a fresh response.

Implementation Patterns

Response caching can be implemented at several levels:

Application layer: Cache keyed on a hash of the prompt, stored in Redis or a similar in-memory store with TTL controls
CDN layer: For public-facing LLM-powered endpoints with shared responses (e.g. public FAQ answers), CDN caching can serve responses at edge nodes, reducing both latency and origin load
API gateway layer: Middleware that intercepts requests before they reach the LLM service, checks the cache, and short-circuits on hits

Combining the Three Layers

In a well-designed LLM application, all three caching layers can coexist and complement each other. A useful mental model is to think of them as a cascade:

Response cache is checked first (exact match, cheapest lookup)
Semantic cache is checked second (approximate match, moderate cost)
Prompt cache operates at the provider level and reduces cost even when the other two layers miss

This layered approach means that the most common queries are served entirely from cache, moderately common queries benefit from semantic matching, and novel queries still benefit from reduced inference cost through prompt caching at the API level.

Cache Invalidation: The Hard Part

Cache invalidation is notoriously difficult in software engineering, and LLM caching introduces additional complexity because the "correctness" of a cached response is harder to verify than a database record. A cached SQL query result is either current or stale. A cached LLM response may be technically accurate but no longer the best answer given updated context, product changes, or shifting user expectations.

Practical principles for LLM cache invalidation:

Match TTL to knowledge volatility: News or pricing data may need TTLs measured in hours. Policy documents might warrant days or weeks.
Separate cache stores by risk level: Cache responses to informational queries more aggressively than responses that affect decisions (bookings, purchases, support escalations)
Log cache hits for audit: In regulated industries, maintain a log of when a user received a cached response versus a live one — this matters for compliance and incident investigation
Monitor hit rate and staleness separately: A high cache hit rate is not inherently good if TTLs are set too long. Track both metrics.

Agentic and Multi-Step Workflows

Caching becomes more complex — and more valuable — in agentic LLM applications where a single user interaction triggers multiple LLM calls across a workflow. Each hop in the chain carries its own latency and cost, and without caching, these multiply quickly.

In multi-step workflows, prompt caching at the API level is especially impactful because the same system prompt and accumulated conversation context is often passed with every step. Structuring prompts so that stable context appears first — and changes only at the end — maximises the chance of cache reuse across steps.

For semantic and response caching in agentic workflows, be cautious: intermediate results that inform later steps may not be safe to cache if they depend on real-time data (tool call outputs, live search results, current user state). Cache the planning layer where appropriate; be conservative about caching execution results.

If you are building or evaluating agentic AI systems, AI Engineering covers how we approach production-grade agent architecture, including latency management.

Cost and Latency Trade-offs at Each Layer

Rather than citing specific numbers — which vary significantly by provider, model, and workload — it is more useful to understand the directional trade-offs:

Factor	Response Cache	Semantic Cache	Prompt Cache
Implementation complexity	Low	Medium-High	Low (provider feature)
Hit rate for typical apps	Low-Medium	Medium	Medium-High
Latency reduction on hit	Very High	High	Moderate
Cost reduction on hit	Very High	Very High	Moderate
Risk of stale responses	High if TTL not set	Medium (threshold sensitive)	Low (partial reuse)
Best suited to	Deterministic, templated	Bounded-domain Q&A	All LLM apps

The right caching strategy depends on your specific query distribution, acceptable staleness tolerance, and the nature of your domain. There is no universal configuration that works across all LLM applications.

Observability: You Cannot Optimise What You Cannot Measure

Caching infrastructure without observability is a liability. Before tuning any caching layer, instrument your application to capture:

Cache hit rate per layer (response, semantic, prompt)
Latency distribution for cached versus uncached requests
Token usage trends over time, broken down by cached versus live inference
Similarity score distribution for semantic cache queries (to inform threshold tuning)
Staleness incidents: cases where a cached response was served after the underlying fact changed

This observability foundation is also what enables confident iteration — you can raise or lower thresholds, adjust TTLs, and restructure prompts knowing whether the change improved outcomes.

Where Caching Fits in a Broader AI Engineering Strategy

Caching is one component of a broader cost and performance strategy for LLM applications. It works alongside — not instead of — model selection (choosing the right model size for each task), prompt engineering (reducing token usage at the source), and infrastructure design (async processing, streaming, request batching).

If your team is still working out the foundational architecture for an LLM-powered product, caching decisions should emerge from that design process rather than be retrofitted later. Our AI Product Strategy service covers this kind of end-to-end design work — from picking the right model and retrieval architecture through to production observability.

For teams that already have something running and want to reduce inference costs or improve response times, a targeted engineering review is usually the fastest path to impact. You can explore how we approach that kind of work through our insights or get in touch directly.

Summary

Caching for LLM applications operates at three levels — prompt, semantic, and response — each addressing different cost and latency drivers. Prompt caching reduces inference cost for repeated context at the API level. Semantic caching handles intent-similar queries across varied phrasing. Response caching handles exact, deterministic repetition. Used together in a layered cascade, they can meaningfully reduce the cost and latency profile of a production LLM system without sacrificing output quality — provided invalidation is handled carefully and observability is in place from the start.

If you are building LLM-powered features and want to get the architecture right from the start — including caching, retrieval, and cost controls — we are happy to help. Reach out to start a conversation with our engineering team.

LLM engineering AI Cost Optimisation Prompt Caching Semantic Caching AI infrastructure

Chris Kerr

Partner at Horizon Labs, an AI product consultancy and venture studio. A commercially focused product and technology leader with 20+ years building and scaling digital platforms, teams, and businesses across SaaS, travel, eCommerce, logistics and transport, and digital marketing — operating at the intersection of product, engineering, and data. Writes about platform strategy, AI transformation, modern data ecosystems, and the operational discipline that separates AI demos from AI products.

23 June 2026

Fine-Tuning Small Language Models for Domain-Specific Tasks

Fine-tuning a small language model can outperform a frontier model on narrow tasks — but only when the task, data, and economics actually justify the overhead. This article covers when fine-tuning makes sense, how to prepare data and evaluate properly, and how to honestly assess the cost trade-off against prompting a frontier model API.

10 min readChris Kerr

23 June 2026

Structured Outputs and Function Calling: Making LLMs Reliable

Structured outputs and function calling are the mechanisms that make LLMs viable in production workflows — but reliable implementation requires deliberate schema design, validation layers, and observability from the start. This guide covers how to use both patterns effectively, when to choose each, and the failure modes that catch teams off-guard at scale.

9 min readChris Kerr

22 June 2026

Agentic RAG: When Retrieval Needs Reasoning

Standard RAG works well when one retrieval pass is enough. Agentic RAG is the architecture for problems that require planning, iterative retrieval, and reasoning over results from multiple sources. This post covers the patterns, the platform options, and the real engineering trade-offs.

7 min readChris Kerr

Caching Strategies for LLM Applications: Reducing Latency and Cost

Why Caching Matters More for LLM Applications Than Conventional APIs

What Is Prompt Caching?

When Does Prompt Caching Apply?

Provider Support

What Is Semantic Caching?

Architecture of a Semantic Cache

What to Cache Semantically

Invalidation Considerations

What Is Response Caching?

Implementation Patterns

Combining the Three Layers

Cache Invalidation: The Hard Part

Agentic and Multi-Step Workflows

Cost and Latency Trade-offs at Each Layer

Observability: You Cannot Optimise What You Cannot Measure

Where Caching Fits in a Broader AI Engineering Strategy

Summary

Related posts

Fine-Tuning Small Language Models for Domain-Specific Tasks

Structured Outputs and Function Calling: Making LLMs Reliable

Agentic RAG: When Retrieval Needs Reasoning

Related posts

Fine-Tuning Small Language Models for Domain-Specific Tasks

Structured Outputs and Function Calling: Making LLMs Reliable

Agentic RAG: When Retrieval Needs Reasoning