LangSmith
Observability is the discipline that separates an AI prototype from an AI product, and LangSmith is the tool that gets us there fastest. Every LLM call is traced with full input/output, intermediate steps, latency, and cost. When a user reports a bad answer, we replay the exact trace and see what the model saw. When we ship a prompt change, we run it through LangSmith's evaluation suite against a curated test set before it touches production. We use LangSmith on every project where production reliability matters — usually paired with our own metrics and alerting layer for the business-critical signals.
What you get
Real examples
Production debugging for a poorly-answered RAG query
Illustrative scenario: a client reports the AI got an answer wrong. We pull the trace from LangSmith: see the original query, the retrieval results (correct documents but wrong ranking), the prompt the LLM received (truncated due to a context overflow), and the output. Root cause identified in minutes — chunking strategy needs to change. Before LangSmith we'd have spent hours reproducing.
Evaluation before shipping a prompt change
Illustrative scenario: we want to swap the system prompt for our content review agent. Run the new prompt through LangSmith's evaluation against 50 historical reviews. Measure: did the AI flag the same issues? Did it produce fewer false positives? Decide based on the data before shipping.
Agent workflow tracing in production
Illustrative scenario: a multi-step agent (extract, validate, route, execute) sometimes fails halfway through. LangSmith traces show which step failed, what state was passed, what the LLM decided. Workflow bugs identified in days that would have taken weeks via logs alone.
Common questions
Do we need LangSmith if we're not using LangChain?
It's most natural with LangChain but works fine without. The SDK lets us instrument any LLM call — Vercel AI SDK, direct Anthropic SDK, custom orchestration — by wrapping the model call. Same traces, same evaluation tooling. We use LangSmith on plenty of non-LangChain projects.
LangSmith vs Langfuse vs Helicone vs writing our own?
Different shapes. LangSmith has the deepest evaluation tooling (datasets, LLM-as-judge, dashboards) — best fit for teams that take pre-deploy testing seriously. Langfuse is open-source and self-hostable, useful when data sovereignty matters. Helicone is simpler observability + cost tracking. Writing your own — fine for basic logging, but you'll end up rebuilding LangSmith over time. Default to LangSmith unless residency forces Langfuse.
Does LangSmith handle data sovereignty?
LangSmith Cloud runs in the US. For Australian data-residency obligations we either route only sanitised traces (PII redacted upstream) or use Langfuse self-hosted on Australian infrastructure. Reviewed per client.
What evaluation tests do you typically run?
Three buckets. Accuracy against labelled test sets (RAG answer correctness, classification accuracy). LLM-as-judge for subjective qualities (tone, helpfulness, hallucination detection). Performance regression (latency p95, cost per call). Every prompt or model change runs through this suite before deploy.
How does LangSmith integrate with CI/CD?
Two paths. One, run evaluation suites as part of the test pipeline — fail the build if accuracy drops more than X%. Two, run a daily evaluation against production traffic samples to detect drift. We've shipped both patterns across client projects; daily eval is the more common starting point.
Ready to get started?
Tell us about your project and we'll tell you honestly how we can help.
Get in Touch