6 June 2026Updated 22 July 20269 min read

AI Inference Optimisation: A Production Decision Guide

Shipping a model is the beginning, not the end. Once an AI feature is live, inference cost and latency become real engineering problems that compound at scale. This guide explains the key optimisation levers available to technical leaders — and how to choose between them.

AI Inference Optimisation: A Production Decision Guide

Shipping a model is the beginning, not the end. Once an AI feature is live, inference cost and latency become real engineering problems — GPU bills compound, response times drift, and product teams start asking why the AI endpoint is the slowest thing in the stack.

ONNX Runtime, combined with quantisation and graph optimisation, is one of the most practical levers available to teams operating AI in production. It does not require retraining your model. It does not require a new architecture. It works on models you have already built.

This guide explains how these techniques work, when to apply them, and what trade-offs to expect — framed as decisions for technical leaders, not implementation tutorials.

What Is ONNX Runtime?

ONNX Runtime is an open-source inference engine developed by Microsoft that executes models defined in the Open Neural Network Exchange (ONNX) format. ONNX is a vendor-neutral, open standard for representing machine learning models — a portable intermediate representation that decouples a model from the framework it was trained in.

Low-angle view from desk surface looking up past a laptop and USB hub toward a bright open-plan office with a whiteboard showing hand-drawn block diagrams in natural daylight.

Once a model is exported to ONNX format, ONNX Runtime can execute it across a wide range of hardware targets — CPU, GPU, and specialised accelerators — using a system of pluggable execution providers. PyTorch, TensorFlow, and scikit-learn models can all be converted and deployed through the same runtime.

The practical consequence is that your training framework becomes a design-time concern, not a production constraint. You train in one framework, export to ONNX, and deploy with a runtime optimised for your target hardware. This separation of concerns matters when you are managing multiple models across different teams or hardware environments.

Why Inference Performance Is a Strategic Concern

Inference performance is a product and cost concern, not just an infrastructure one. Latency affects user experience directly in synchronous AI features — search ranking, document processing, recommendation, fraud detection. Cost scales with every request, and at production volumes, the gap between an optimised and unoptimised inference path becomes commercially significant.

Most Australian engineering teams underinvest in inference optimisation because the gains are less visible than model accuracy improvements. The model works — it just works slowly, or expensively. That tension sharpens when the product scales or when a cost review hits the AI infrastructure line.

The three primary levers available to technical leaders — without retraining the model — are:

Runtime selection — running the model through an optimised execution engine rather than the training framework it was built in
Graph optimisation — restructuring the computational graph to remove redundant operations and improve execution efficiency
Quantisation — reducing numerical precision to decrease memory and compute requirements

Each lever involves different trade-offs in terms of implementation effort, risk, and payoff. Understanding those trade-offs is the decision that matters.

Graph Optimisation: Low Risk, High Default Value

Graph optimisation is the process of transforming a model's computational graph to make it more efficient at execution time. ONNX Runtime applies these transformations automatically when a model is loaded, without any change to model weights or outputs.

View framed between two monitor bezels looking toward a whiteboard covered in hand-drawn computational graph diagrams with operator fusion annotations, lit by warm task lamp glow in a dark engineering workspace.

The optimisations occur at three levels:

Basic optimisations handle straightforward clean-up: pre-computing static parts of the graph, eliminating redundant nodes, and resolving shape information. These are low-risk and almost always beneficial with no meaningful downside.

Extended optimisations apply more aggressive transformations such as operator fusion — merging consecutive operations into single, more efficient kernels. This reduces memory bandwidth pressure and the overhead associated with executing many small operations in sequence. The gains are most pronounced in transformer-based architectures.

Layout optimisations reorder data in memory to better match the target hardware's preferred access patterns. This matters most when targeting GPU execution providers and tends to have minimal effect on CPU-only deployments.

The important leadership point here is that graph optimisation is deterministic. Outputs before and after optimisation can be compared and validated against acceptable numerical tolerance. This makes it a low-risk default step in any model deployment pipeline — the kind of thing that should be standard practice rather than an optional performance project.

Quantisation: The Bigger Lever, With Trade-offs to Manage

Quantisation is the technique of reducing the numerical precision of a model's weights and activations — most commonly from 32-bit floating point to 16-bit or 8-bit representations. Lower precision means smaller model size, reduced memory bandwidth, and faster arithmetic on hardware that supports it.

The trade-off is accuracy. Quantisation introduces small numerical errors that, depending on the model and the task, may be negligible or may require careful validation. This is not a reason to avoid quantisation — it is a reason to measure carefully before deploying.

ONNX Runtime supports two primary quantisation approaches, each suited to different operational contexts:

Post-Training Quantisation (PTQ)

Post-training quantisation is applied after training is complete, without access to training infrastructure or the full training dataset. It is the fastest path to a quantised model and the right starting point for most teams.

Dynamic quantisation compresses weights at export time and handles activations at runtime. It is particularly effective for transformer-based models and natural language processing tasks, and it requires no calibration data. The implementation effort is low, making it a practical first experiment for most production AI teams.

Static quantisation compresses both weights and activations, using a small calibration dataset to determine the ranges of values that activations take during inference. It typically delivers better throughput than dynamic quantisation, but it requires a representative calibration set and more validation work. The appropriate choice depends on whether your team has the operational maturity to manage that validation process reliably.

Quantisation-Aware Training (QAT)

Quantisation-aware training simulates the effects of reduced precision during the training process, allowing the model to adapt its weights to minimise accuracy loss. It typically preserves accuracy better than post-training approaches, but it requires access to training infrastructure, datasets, and engineering time.

For most teams evaluating inference optimisation, QAT is not the right starting point. It makes sense when post-training quantisation produces accuracy degradation that is unacceptable for the use case, and when the team has the ML engineering capacity to run additional training cycles.

Choosing the Right Approach for Your Context

The decision between these techniques is not purely technical — it depends on your team's operational maturity, the accuracy requirements of the specific model, and where you are in the AI product lifecycle.

A useful decision frame:

Context	Recommended starting point
Model in production, accuracy is non-critical, team is small	Graph optimisation + dynamic quantisation
Model in production, accuracy is business-critical	Graph optimisation + static quantisation with careful validation
Model still in training, accuracy degradation is a known risk	Consider quantisation-aware training
Multiple models across different hardware targets	ONNX export as a standard step in your deployment pipeline

The table above is a framework, not a formula. Every model and production environment has specific characteristics that affect which approach delivers the best outcome. The right move is to measure before and after, validate accuracy against your acceptance criteria, and treat inference optimisation as an ongoing concern rather than a one-time project.

What Australian Teams Often Miss

Inference optimisation is frequently treated as a post-launch performance task rather than a first-class engineering concern. That sequencing is understandable — getting a model to production is hard enough — but it creates compounding costs that are difficult to unwind later.

The teams that manage inference costs well tend to do a few things differently:

They establish inference latency and cost targets before deployment, not after
They instrument inference endpoints with the same rigour as application services
They treat model drift and performance degradation as operationally equivalent problems
They include inference optimisation in their AI engineering review cycle, not just their model development cycle

This connects directly to the broader challenge of AI operations — the discipline of keeping AI systems performing reliably and economically in production over time. Inference optimisation is one part of that discipline. MLOps practices, monitoring, and deployment tooling are others.

The Decision That Matters Most

The most important inference optimisation decision is not which technique to apply — it is whether your team has the operational foundation to apply any technique safely and measure the result.

Running ONNX Runtime with graph optimisation enabled is low-risk and broadly applicable. Quantisation requires more careful validation. Quantisation-aware training requires ML engineering depth that many growing Australian teams do not yet have in-house.

If your team is evaluating these options without a clear view of your current inference performance baseline, the right first step is measurement, not optimisation. You cannot improve what you have not defined.

For teams building or scaling AI-powered platforms, inference optimisation is one component of a broader data infrastructure and operations capability. Getting it right early avoids the more expensive work of re-engineering production systems under cost pressure.

If you are working through these decisions and want a technical perspective on your specific stack and use case, get in touch — we are happy to have a direct conversation about what the options look like for your environment.

For more on taking AI from prototype to production, see our insights or read more about our AI engineering and AI product strategy capabilities.

AI engineering MLOps Inference Optimisation ONNX Production AI

Chris Kerr

Partner at Horizon Labs, an AI product consultancy and venture studio. A commercially focused product and technology leader with 20+ years building and scaling digital platforms, teams, and businesses across SaaS, travel, eCommerce, logistics and transport, and digital marketing — operating at the intersection of product, engineering, and data. Writes about platform strategy, AI transformation, modern data ecosystems, and the operational discipline that separates AI demos from AI products.

16 July 2026

Application Modernisation in Australia: The Complete 2025 Guide

A practical guide to application modernisation for Australian engineering leaders — covering patterns like strangler fig and re-architecture, architecture maturity trade-offs, and Australian-specific context including the Essential Eight and Hosting Certification Framework.

6 min readChris Kerr

14 July 2026

Planning an AI Engagement: What Production Delivery Requires

Before committing budget to an AI initiative, it's worth agreeing on what production-grade delivery actually means. This guide covers the standards worth setting for any AI engagement — from production track record to MLOps planning and IP ownership.

6 min readChris Kerr

8 July 2026

AI for Australian Manufacturing: 5 Use Cases That Work

Australian manufacturers are deploying production AI across five use cases today: predictive maintenance, computer vision quality inspection, document AI for compliance, demand forecasting, and procurement automation. This practitioner overview covers what makes each use case work in production — and where each one fails — for CTOs and engineering leaders evaluating where to start.

9 min readChris Kerr

AI Inference Optimisation: A Production Decision Guide