AI Inference Optimisation: A Production Decision Guide
Shipping a model is the beginning, not the end. Once an AI feature is live, inference cost and latency become real engineering problems that compound at scale. This guide explains the key optimisation levers available to technical leaders — and how to choose between them.

AI Inference Optimisation: A Production Decision Guide
Shipping a model is the beginning, not the end. Once an AI feature is live, inference cost and latency become real engineering problems — GPU bills compound, response times drift, and product teams start asking why the AI endpoint is the slowest thing in the stack.
ONNX Runtime, combined with quantisation and graph optimisation, is one of the most practical levers available to teams operating AI in production. It does not require retraining your model. It does not require a new architecture. It works on models you have already built.
This guide explains how these techniques work, when to apply them, and what trade-offs to expect — framed as decisions for technical leaders, not implementation tutorials.
What Is ONNX Runtime?
ONNX Runtime is an open-source inference engine developed by Microsoft that executes models defined in the Open Neural Network Exchange (ONNX) format. ONNX is a vendor-neutral, open standard for representing machine learning models — a portable intermediate representation that decouples a model from the framework it was trained in.

Once a model is exported to ONNX format, ONNX Runtime can execute it across a wide range of hardware targets — CPU, GPU, and specialised accelerators — using a system of pluggable execution providers. PyTorch, TensorFlow, and scikit-learn models can all be converted and deployed through the same runtime.
The practical consequence is that your training framework becomes a design-time concern, not a production constraint. You train in one framework, export to ONNX, and deploy with a runtime optimised for your target hardware. This separation of concerns matters when you are managing multiple models across different teams or hardware environments.
Why Inference Performance Is a Strategic Concern
Inference performance is a product and cost concern, not just an infrastructure one. Latency affects user experience directly in synchronous AI features — search ranking, document processing, recommendation, fraud detection. Cost scales with every request, and at production volumes, the gap between an optimised and unoptimised inference path becomes commercially significant.
Most Australian engineering teams underinvest in inference optimisation because the gains are less visible than model accuracy improvements. The model works — it just works slowly, or expensively. That tension sharpens when the product scales or when a cost review hits the AI infrastructure line.
The three primary levers available to technical leaders — without retraining the model — are:
- Runtime selection — running the model through an optimised execution engine rather than the training framework it was built in
- Graph optimisation — restructuring the computational graph to remove redundant operations and improve execution efficiency
- Quantisation — reducing numerical precision to decrease memory and compute requirements
Each lever involves different trade-offs in terms of implementation effort, risk, and payoff. Understanding those trade-offs is the decision that matters.
Graph Optimisation: Low Risk, High Default Value
Graph optimisation is the process of transforming a model's computational graph to make it more efficient at execution time. ONNX Runtime applies these transformations automatically when a model is loaded, without any change to model weights or outputs.

The optimisations occur at three levels:
Basic optimisations handle straightforward clean-up: pre-computing static parts of the graph, eliminating redundant nodes, and resolving shape information. These are low-risk and almost always beneficial with no meaningful downside.
Extended optimisations apply more aggressive transformations such as operator fusion — merging consecutive operations into single, more efficient kernels. This reduces memory bandwidth pressure and the overhead associated with executing many small operations in sequence. The gains are most pronounced in transformer-based architectures.
Layout optimisations reorder data in memory to better match the target hardware's preferred access patterns. This matters most when targeting GPU execution providers and tends to have minimal effect on CPU-only deployments.
The important leadership point here is that graph optimisation is deterministic. Outputs before and after optimisation can be compared and validated against acceptable numerical tolerance. This makes it a low-risk default step in any model deployment pipeline — the kind of thing that should be standard practice rather than an optional performance project.
Quantisation: The Bigger Lever, With Trade-offs to Manage
Quantisation is the technique of reducing the numerical precision of a model's weights and activations — most commonly from 32-bit floating point to 16-bit or 8-bit representations. Lower precision means smaller model size, reduced memory bandwidth, and faster arithmetic on hardware that supports it.
The trade-off is accuracy. Quantisation introduces small numerical errors that, depending on the model and the task, may be negligible or may require careful validation. This is not a reason to avoid quantisation — it is a reason to measure carefully before deploying.
ONNX Runtime supports two primary quantisation approaches, each suited to different operational contexts:
Post-Training Quantisation (PTQ)
Post-training quantisation is applied after training is complete, without access to training infrastructure or the full training dataset. It is the fastest path to a quantised model and the right starting point for most teams.
Dynamic quantisation compresses weights at export time and handles activations at runtime. It is particularly effective for transformer-based models and natural language processing tasks, and it requires no calibration data. The implementation effort is low, making it a practical first experiment for most production AI teams.
Static quantisation compresses both weights and activations, using a small calibration dataset to determine the ranges of values that activations take during inference. It typically delivers better throughput than dynamic quantisation, but it requires a representative calibration set and more validation work. The appropriate choice depends on whether your team has the operational maturity to manage that validation process reliably.
Quantisation-Aware Training (QAT)
Quantisation-aware training simulates the effects of reduced precision during the training process, allowing the model to adapt its weights to minimise accuracy loss. It typically preserves accuracy better than post-training approaches, but it requires access to training infrastructure, datasets, and engineering time.
For most teams evaluating inference optimisation, QAT is not the right starting point. It makes sense when post-training quantisation produces accuracy degradation that is unacceptable for the use case, and when the team has the ML engineering capacity to run additional training cycles.
Choosing the Right Approach for Your Context
The decision between these techniques is not purely technical — it depends on your team's operational maturity, the accuracy requirements of the specific model, and where you are in the AI product lifecycle.
A useful decision frame:
| Context | Recommended starting point |
|---|---|
| Model in production, accuracy is non-critical, team is small | Graph optimisation + dynamic quantisation |
| Model in production, accuracy is business-critical | Graph optimisation + static quantisation with careful validation |
| Model still in training, accuracy degradation is a known risk | Consider quantisation-aware training |
| Multiple models across different hardware targets | ONNX export as a standard step in your deployment pipeline |
The table above is a framework, not a formula. Every model and production environment has specific characteristics that affect which approach delivers the best outcome. The right move is to measure before and after, validate accuracy against your acceptance criteria, and treat inference optimisation as an ongoing concern rather than a one-time project.
What Australian Teams Often Miss
Inference optimisation is frequently treated as a post-launch performance task rather than a first-class engineering concern. That sequencing is understandable — getting a model to production is hard enough — but it creates compounding costs that are difficult to unwind later.
The teams that manage inference costs well tend to do a few things differently:
- They establish inference latency and cost targets before deployment, not after
- They instrument inference endpoints with the same rigour as application services
- They treat model drift and performance degradation as operationally equivalent problems
- They include inference optimisation in their AI engineering review cycle, not just their model development cycle
This connects directly to the broader challenge of AI operations — the discipline of keeping AI systems performing reliably and economically in production over time. Inference optimisation is one part of that discipline. MLOps practices, monitoring, and deployment tooling are others.
The Decision That Matters Most
The most important inference optimisation decision is not which technique to apply — it is whether your team has the operational foundation to apply any technique safely and measure the result.
Running ONNX Runtime with graph optimisation enabled is low-risk and broadly applicable. Quantisation requires more careful validation. Quantisation-aware training requires ML engineering depth that many growing Australian teams do not yet have in-house.
If your team is evaluating these options without a clear view of your current inference performance baseline, the right first step is measurement, not optimisation. You cannot improve what you have not defined.
For teams building or scaling AI-powered platforms, inference optimisation is one component of a broader data infrastructure and operations capability. Getting it right early avoids the more expensive work of re-engineering production systems under cost pressure.
If you are working through these decisions and want a technical perspective on your specific stack and use case, get in touch — we are happy to have a direct conversation about what the options look like for your environment.
For more on taking AI from prototype to production, see our insights or read more about our AI engineering and AI product strategy capabilities.
Chris Kerr
Founder of Horizon Labs. Twenty years building production software for Australian mid-market businesses, the last seven focused on putting AI into systems that operate at 3am without anyone watching. Writes about strategy, fractional CTO work, and the operational discipline that separates AI demos from AI products.


