LLM Evaluation in Production: A Three-Layer Approach
Shipping an LLM feature is the easy part. Knowing whether it still works correctly six weeks later — after a prompt change, a model version bump, or a shift in user behaviour — is where most teams struggle. This post covers a three-layer evaluation approach that gives engineering teams real confidence in production LLM systems.

LLM evaluation is one of the least glamorous and most important disciplines in production AI engineering. Shipping a language model feature is relatively straightforward. Knowing whether it still works correctly six weeks later — after a prompt change, a model version bump, or a shift in user behaviour — is the hard part. This post covers why eval-driven development deserves a place alongside test-driven development in your engineering culture, and how three complementary layers — LangSmith, Guardrails AI, and bespoke eval suites — combine to give you real confidence in production LLM systems.
Why Does LLM Evaluation Matter?
LLM evaluation is the practice of systematically measuring whether a language model application behaves correctly, safely, and consistently across the range of inputs it will encounter. Unlike deterministic software where a unit test either passes or fails, LLM outputs are probabilistic and context-sensitive. The same prompt can produce subtly different answers across model versions, temperature settings, or retrieval results — and many of those differences matter for your users.

Without a structured evaluation approach, teams fall into a pattern of manual spot-checking. An engineer runs a handful of prompts, the outputs look reasonable, and the feature ships. This works until it doesn't — and it usually stops working at the worst possible time: after a model provider silently updates a hosted model, after a retrieval index is rebuilt, or after a seemingly minor prompt edit. Regressions in LLM applications are real, they are subtle, and they are invisible without systematic measurement.
Eval-driven development means building your evaluation harness before or alongside the feature itself, not as an afterthought. It treats LLM quality as a first-class engineering concern — one that can block a deployment the same way a failing integration test would.
What Makes LLM Evaluation Different from Traditional Software Testing?
Traditional testing is binary: a function returns the expected value or it does not. LLM evaluation operates on a spectrum. You are assessing dimensions like factual accuracy, relevance, tone, safety, and coherence — none of which map cleanly to a pass/fail assertion.

This creates three structural challenges that are unique to LLM systems:
The ground truth problem. For many LLM tasks, there is no single correct answer. A customer support response can be helpful in multiple different ways. Evaluation requires defining what "good" looks like — often through rubrics, human baselines, or reference outputs — before you can measure progress or regression.
The distribution shift problem. User inputs in production rarely look exactly like the examples you tested with during development. Production eval suites need to be seeded with real traffic, not just synthetic prompts crafted by the team that built the feature.
The model dependency problem. When your application calls a hosted model API, the model itself is a moving dependency. Providers update models — sometimes transparently, sometimes not. Your application's behaviour is only as stable as the model layer beneath it, which means your eval suite needs to run continuously, not just at deploy time.
LangSmith: Observability and Regression Testing for LLM Pipelines
LangSmith is a platform built by the team behind LangChain that provides tracing, dataset management, and automated evaluation for LLM applications. Its core value in a production setting is not just debugging individual runs — it is building a structured feedback loop from production traces back into your evaluation datasets.
In practice, LangSmith earns its place in a production stack through three capabilities:
Trace capture and inspection. Every call in your LLM pipeline — retrieval, prompt construction, model inference, output parsing — is logged as a structured trace. When a user reports an unexpected output, you can replay the exact chain of events rather than guessing.
Dataset curation from production traffic. LangSmith lets you promote real production traces into evaluation datasets. This is the most practical way to address the distribution shift problem: your eval suite grows to reflect what users actually send, not what you imagined they would send.
Automated evaluation runs. You can define evaluators — whether rule-based, model-graded, or human-reviewed — and run them against your dataset whenever you push a change. This is the equivalent of a CI pipeline for your LLM application. A prompt refactor that degrades performance on your curated dataset surfaces as a regression before it reaches users.
LangSmith does not enforce guardrails at inference time — it is primarily an offline and diagnostic tool. For runtime enforcement, you need a complementary layer.
Guardrails AI: Runtime Validation and Output Enforcement
Guardrails AI is an open-source framework that adds validation logic to LLM inputs and outputs at inference time. Where LangSmith tells you what happened, Guardrails AI decides what is allowed to happen.
The framework works by defining a set of validators — checks that run against model outputs before they are returned to the user. These validators can cover a wide range of concerns: detecting personally identifiable information in a response, confirming that structured output matches an expected schema, flagging responses that fall outside a defined topic boundary, or checking that cited sources actually exist in the retrieval context.
The practical value of runtime validation is that it creates a safety net between model inference and your application layer. Rather than relying entirely on prompt engineering to keep outputs well-behaved, you add a programmatic check that can either remediate the output, retry the call, or fail gracefully with a controlled fallback.
Guardrails AI is not a silver bullet. Validators add latency, and poorly configured validators can generate false positives that degrade the user experience. The discipline is in calibrating validators to catch genuine failure modes without over-constraining outputs that are actually fine. Like any production safety layer, it requires tuning against real traffic, not just theoretical edge cases.
It is also worth noting that Guardrails AI addresses output quality and safety constraints, not broader model performance. A response can pass all your validators and still be factually wrong or unhelpful — which is why it works as a complement to offline evaluation, not a replacement for it.
Custom Eval Suites: Building Domain-Specific Quality Benchmarks
Generalist evaluation tooling gets you a long way, but most production LLM applications have domain-specific quality requirements that off-the-shelf evaluators cannot fully capture. A legal document summarisation tool has different correctness criteria than a customer support chatbot or a code review assistant. Custom eval suites are how you encode that domain knowledge into your evaluation pipeline.
A well-constructed custom eval suite typically has three components:
A curated dataset of representative inputs. This should be drawn from real production traffic wherever possible, supplemented with synthetic edge cases that cover known failure modes. The dataset should be treated as a living artefact — reviewed and extended as user behaviour evolves.
A set of domain-specific evaluators. These might be rule-based checks (does the output contain a required disclosure?), model-graded assessments (does a second LLM judge the response as accurate given the retrieved context?), or human-reviewed samples on a cadence. The right mix depends on the stakes of the application and the cost of each evaluation method.
A scoring framework that connects to your deployment pipeline. Evaluation scores are only useful if they influence decisions. A custom eval suite that runs in isolation but does not gate deployments or surface trends to the team has limited practical value. The output of every eval run should feed into a dashboard that engineering and product leadership can act on.
Building a custom eval suite takes time upfront, but it compounds. Each production incident that you capture as a test case makes the next deployment safer. Teams that invest in eval infrastructure early tend to move faster over time, because they can make changes with confidence rather than caution.
Putting the Three Layers Together
These three tools address different points in the LLM application lifecycle and work best in combination:
- LangSmith provides the observability and regression testing layer — capturing what happens in production, curating datasets from real traffic, and running automated evaluations when you push changes.
- Guardrails AI provides the runtime enforcement layer — validating outputs at inference time and providing a safety net between the model and your users.
- Custom eval suites provide the domain-specific quality layer — encoding what "correct" means for your particular application and ensuring that standard passes in one tool do not mask failures that matter to your users.
None of these layers removes the fundamental challenge of LLM evaluation: there is no clean ground truth, and your inputs are always shifting. But together they give you a systematic way to detect regressions, enforce safety constraints, and maintain quality standards as your application and its underlying models evolve.
Production AI is genuinely hard. The teams that do it well treat evaluation as an engineering discipline with the same rigour they apply to testing, monitoring, and incident response — not as a checkbox exercise before launch.
Working with an AI Engineering Partner
Building robust LLM evaluation infrastructure requires experience with the full lifecycle of production AI — from prompt design through to observability and continuous improvement. Our AI engineering practice works with Australian engineering teams to design and implement evaluation pipelines that reflect the specific risks and quality requirements of each application.
If you are building LLM-powered features and want to establish a rigorous evaluation foundation, we are happy to talk through your current approach and where the gaps might be. You can also explore our AI product strategy and application modernisation capabilities, or browse more insights from the Horizon Labs team.
Get in touch to start a conversation about your AI engineering challenges.
Chris Kerr
Founder of Horizon Labs. Twenty years building production software for Australian mid-market businesses, the last seven focused on putting AI into systems that operate at 3am without anyone watching. Writes about strategy, fractional CTO work, and the operational discipline that separates AI demos from AI products.

