23 June 2026Updated 23 June 20269 min read

Structured Outputs and Function Calling: Making LLMs Reliable

Structured outputs and function calling are the mechanisms that make LLMs viable in production workflows — but reliable implementation requires deliberate schema design, validation layers, and observability from the start. This guide covers how to use both patterns effectively, when to choose each, and the failure modes that catch teams off-guard at scale.

Large language models are remarkably capable at reasoning and generating language — but in production workflows, free-form text output is a liability. When your system needs to route a customer request, populate a database record, or trigger a downstream API call, you cannot afford to parse natural language and hope for the best. Structured outputs and function calling are the mechanisms that bridge the gap between a model that thinks and a system that acts.

This post is a decision guide for technical leaders building LLM-powered applications. It covers what these mechanisms are, when to use each, and how to design for reliability rather than optimism.

What Are Structured Outputs in LLMs?

Structured outputs are a mode of LLM operation where the model is constrained to return data in a predefined schema — most commonly JSON — rather than arbitrary text. Instead of asking a model to "summarise the customer complaint", you ask it to return a JSON object with fields like category, severity, and summary. Modern inference APIs, including those from OpenAI, Anthropic, and Google, support schema-constrained generation natively.

A software engineer viewed from the side, face lit by monitor glow showing JSON schema code, working at a dimly lit desk in an Australian engineering office at night.

The core benefit is determinism of shape. You still cannot fully control what the model says inside each field, but you can guarantee the structure your downstream code receives. This is the difference between a system that sometimes works and one you can build automated workflows on top of.

JSON Schema as a Contract

JSON Schema is the standard used to define these constraints. It is a vocabulary that describes the shape, types, and required fields of a JSON document. When you pass a JSON Schema definition to a model via the API, the inference engine uses constrained decoding — guiding the token-generation process so that only outputs conforming to the schema are produced.

This means you define your output contract once, enforce it at inference time, and remove an entire class of parsing failures from your application. For teams building on top of LLMs, this is one of the most practical reliability improvements available today.

Key schema design principles:

Keep schemas as flat as reasonable. Deeply nested schemas increase the cognitive load on the model and the chance of subtle errors in populated values.
Use enum types for categorical fields. If a field can only be high, medium, or low, declare it as an enum — the model cannot hallucinate a fourth option.
Mark fields as required explicitly. Do not rely on the model to infer which fields matter.
Use descriptive field names. The model reads your schema — customer_sentiment_score is far more instructive than score1.

What Is Function Calling (Tool Calling)?

Function calling — now more commonly called tool calling — is a mechanism that lets the model signal its intent to invoke an external capability, rather than answering directly. You define a set of tools the model can choose from (each described with a name, description, and parameter schema), and the model returns a structured call object specifying which tool to invoke and with what arguments. Your application then executes the tool and, optionally, returns the result to the model for further reasoning.

Overhead view of an engineering desk with a hand-drawn tool-calling loop diagram in a notebook, surrounded by a laptop, coffee cup, and sticky notes, bathed in warm golden-hour window light.

This is the foundation of agentic patterns. Without reliable tool calling, an agent cannot interact with your systems in a predictable way.

How the Tool Calling Loop Works

The typical pattern is a multi-turn loop:

Your application sends a user message and a list of available tools to the model.
The model responds with a tool call — a structured object identifying the tool name and arguments.
Your application executes the tool (an API call, a database query, a calculation).
The result is returned to the model as a tool response message.
The model uses the result to generate a final answer or make another tool call.

The model never directly executes anything. It only signals intent. Your application code is always in the execution seat — which means you retain control, and you can add validation, logging, and guardrails at step 3 before anything actually runs.

Structured Outputs vs Function Calling: When to Use Each

These two mechanisms are complementary, not competing. Understanding when to reach for each one is the key design decision.

Scenario	Best Mechanism	Why
Extract structured data from unstructured text	Structured outputs	You want a parsed record, not a tool execution
Classify or label content into fixed categories	Structured outputs with enums	Schema enforces category bounds
Trigger an external API or side effect	Function/tool calling	Model signals intent; app executes
Multi-step reasoning with real-world data retrieval	Tool calling loop	Model can request data, reason, act iteratively
Generate a form-fillable document or report	Structured outputs	Output is data to be consumed, not an action
Orchestrate an agent across multiple systems	Tool calling	Enables sequential, conditional action chains

A common mistake is overreaching with tool calling when structured outputs would suffice. If you only need to extract information from text, constrained JSON generation is simpler, faster, and less error-prone than building an agent loop around it.

Designing for Reliability in Production

Getting structured outputs and tool calling to work in a demo is straightforward. Getting them to work reliably at production scale, across a distribution of real-world inputs, is a different problem entirely.

Prompt Design Matters as Much as Schema Design

Constrained decoding enforces the shape of the output, but the quality of the values inside that shape depends entirely on your prompt. A poorly written system prompt will produce structurally valid but semantically wrong outputs — the schema will be satisfied, and your downstream logic will still break.

Write your system prompt as if you are writing a specification for a junior analyst. Be explicit about edge cases. If a field should be null when information is absent, say so. If severity means business impact rather than technical severity, define it.

Validate Beyond the Schema

Schema conformance is a floor, not a ceiling. After the model returns a structured response, apply your own validation layer:

Range checks: Is a confidence score between 0 and 1? Is a date in the future when it should be?
Cross-field consistency: If action_required is true, is an action_description populated?
Downstream constraints: Does the returned account_id actually exist in your database?

This validation layer is the difference between a prototype and a production system. Model outputs, even constrained ones, are probabilistic. Your validation logic is deterministic. Layer them.

Design Tool Descriptions as Documentation

The model decides which tool to call based on the names and descriptions you provide — not on any deep understanding of your system. Treat tool descriptions like API documentation: accurate, specific, and unambiguous. If two tools have overlapping descriptions, the model will make inconsistent choices. If a description is vague, the model will fill the gap with inference that may not match your intent.

Handle Tool Failures Gracefully

When a tool call fails — a timeout, a validation error, a downstream API rate limit — your application needs a clear strategy. Common patterns include:

Returning the error message to the model with context, allowing it to retry with corrected arguments.
Falling back to a simpler, non-agentic path for the same request.
Surfacing a degraded but honest response to the end user rather than silently failing.

Silent failures in agentic loops are particularly dangerous because the model may continue reasoning on a false premise if it never receives an error signal.

Logging and Observability

Every tool call and every structured output your application generates should be logged with the full input context. This is not optional in production — it is the primary mechanism for debugging unexpected behaviour, identifying prompt regressions, and monitoring for model drift after an API or model version update. Standard observability tooling does not automatically capture LLM interaction payloads; you need to instrument this deliberately.

Common Failure Modes to Plan For

Production LLM applications fail in specific, recurring ways. Knowing them in advance is more useful than discovering them through incidents.

Schema-conformant but semantically wrong: The model returns a valid JSON object where all fields are populated with plausible-sounding but incorrect values. This is especially common on rare or ambiguous inputs. Mitigation: add domain-specific validation and, where stakes are high, a human review queue.

Tool selection errors: The model chooses the wrong tool, or calls a tool unnecessarily. This typically signals overlapping tool descriptions or an unclear system prompt. Mitigation: audit tool descriptions for specificity; reduce the tool set to the minimum necessary.

Argument hallucination: The model populates tool arguments with values that look valid but are not — an account ID that does not exist, a date that is out of range. Mitigation: validate all tool arguments before execution, not after.

Context window exhaustion in long loops: Multi-turn tool calling loops accumulate message history. On long tasks, you can hit context limits, causing truncation or degraded output. Mitigation: design loop termination conditions explicitly; summarise intermediate results when the context grows large.

What This Means for Your Architecture

Structured outputs and tool calling are not just API features — they are architectural decisions that shape how you separate concerns in an LLM-powered system.

The model is responsible for reasoning and intent. Your application is responsible for execution, validation, and state. When you blur this boundary — letting the model do too much, or trusting its outputs without verification — you introduce fragility that is hard to diagnose and harder to fix in production.

For teams building AI-powered platforms and internal tools, this separation is the foundation of every reliable integration we build. The model is a reasoning engine; your application is the system of record.

If you are at the stage of designing your AI product strategy — deciding which workflows to automate, what the human-in-the-loop checkpoints are, and how to sequence capability builds — these are the right questions to be asking before writing a line of integration code.

Getting These Patterns Right From the Start

Structured outputs and function calling unlock genuinely powerful automation — but they require deliberate design to be reliable. Schema design, prompt engineering, validation layers, observability, and failure handling are not afterthoughts. They are the work.

For more on building AI systems that hold up in production, explore our insights on AI engineering, MLOps, and the practical foundations of LLM application development.

If you are building an LLM-powered workflow and want to pressure-test your architecture before it reaches production, we are happy to take a look. Tell us what you are building and where the rough edges are — that is usually the most useful place to start.

LLM Application Development AI engineering structured outputs function calling Production AI

Chris Kerr

Partner at Horizon Labs, an AI product consultancy and venture studio. A commercially focused product and technology leader with 20+ years building and scaling digital platforms, teams, and businesses across SaaS, travel, eCommerce, logistics and transport, and digital marketing — operating at the intersection of product, engineering, and data. Writes about platform strategy, AI transformation, modern data ecosystems, and the operational discipline that separates AI demos from AI products.

23 June 2026

Fine-Tuning Small Language Models for Domain-Specific Tasks

Fine-tuning a small language model can outperform a frontier model on narrow tasks — but only when the task, data, and economics actually justify the overhead. This article covers when fine-tuning makes sense, how to prepare data and evaluate properly, and how to honestly assess the cost trade-off against prompting a frontier model API.

10 min readChris Kerr

23 June 2026

Caching Strategies for LLM Applications: Reducing Latency and Cost

Caching is one of the most underused levers for reducing cost and latency in production LLM applications. This article covers prompt caching, semantic caching, and response caching — what each layer does, when to use it, and how to think about invalidation and observability.

11 min readChris Kerr

22 June 2026

Agentic RAG: When Retrieval Needs Reasoning

Standard RAG works well when one retrieval pass is enough. Agentic RAG is the architecture for problems that require planning, iterative retrieval, and reasoning over results from multiple sources. This post covers the patterns, the platform options, and the real engineering trade-offs.

7 min readChris Kerr

Structured Outputs and Function Calling: Making LLMs Reliable

What Are Structured Outputs in LLMs?

JSON Schema as a Contract

What Is Function Calling (Tool Calling)?

How the Tool Calling Loop Works

Structured Outputs vs Function Calling: When to Use Each

Designing for Reliability in Production

Prompt Design Matters as Much as Schema Design

Validate Beyond the Schema

Design Tool Descriptions as Documentation

Handle Tool Failures Gracefully

Logging and Observability

Common Failure Modes to Plan For

What This Means for Your Architecture

Getting These Patterns Right From the Start

Related posts

Fine-Tuning Small Language Models for Domain-Specific Tasks

Caching Strategies for LLM Applications: Reducing Latency and Cost

Agentic RAG: When Retrieval Needs Reasoning

Related posts

Fine-Tuning Small Language Models for Domain-Specific Tasks

Caching Strategies for LLM Applications: Reducing Latency and Cost

Agentic RAG: When Retrieval Needs Reasoning