6 June 2026Updated 22 July 202610 min read

Open-Source vs Proprietary LLMs: When to Self-Host

Self-hosting open-weight models like Llama or Mistral gives Australian teams data residency, cost control at scale, and deployment flexibility — but it comes with real infrastructure and operational tradeoffs. This post works through when each approach makes sense, with a practical framework for mid-market teams deciding between open-weight self-hosting and proprietary APIs like GPT-4 or Claude.

Open-Source vs Proprietary LLMs: When to Self-Host

The decision to self-host an open-weight model like Llama or Mistral — rather than call a proprietary API like GPT-4 or Claude — is one of the most consequential infrastructure choices an AI engineering team can make. Get it right and you get cost control, data residency, and deployment flexibility. Get it wrong and you get an underpowered model, a GPU bill you didn't budget for, and an ops burden your team wasn't staffed to carry.

This post works through the real tradeoffs. No benchmarks cherry-picked to flatter a preference. Just an honest account of when each path makes sense for Australian mid-market teams.

What Are Open-Weight Models?

An open-weight model is a large language model whose trained parameters are publicly released, allowing anyone to download, run, and modify the model on their own infrastructure. Llama (Meta) and Mistral are the most widely deployed examples. "Open-weight" is more precise than "open-source" because the training data and some licensing terms vary — but in practice, these models can be self-hosted without paying a per-token fee to a vendor.

View through an office doorway into a small infrastructure room where an engineer sketches a system diagram on a whiteboard, lit by warm afternoon window light against cooler corridor shadows.

Proprietary models — GPT-4 (OpenAI), Claude (Anthropic), Gemini (Google) — are accessed exclusively via API. You send a request, the vendor's infrastructure runs inference, you receive a response. You never hold the weights.

The Core Tradeoff in One Sentence

Proprietary APIs give you the best available capability at the lowest entry cost; open-weight self-hosting gives you control, data residency, and lower marginal cost at scale — but requires you to own the infrastructure and accept a capability gap on the most demanding tasks.

Bright daylight-lit standing desk in an Australian open-plan office showing dual monitors with blurred terminal and code editor screens, a mechanical keyboard, notebook, and takeaway coffee cup.

When Self-Hosting Open-Weight Models Makes Sense

Your data cannot leave your environment

Data residency is the clearest argument for self-hosting. If your application processes personal information governed by the Privacy Act 1988 and Australian Privacy Principles, or if you operate in a regulated sector — financial services, health, legal — sending that data to a US-based API endpoint creates jurisdictional risk that your legal team will rightly flag.

Propriety API vendors have improved their data handling commitments significantly (Azure OpenAI, for example, offers Australian region endpoints and contractual data residency commitments), but not every team has the procurement leverage or legal bandwidth to negotiate those terms. Self-hosting on your own cloud tenancy in ap-southeast-2 (Sydney) or ap-southeast-4 (Melbourne) removes the question entirely.

Your call volume is high and the unit economics have inverted

Proprietary APIs price on tokens. At low volume, the operational simplicity of an API is worth the per-token cost. As volume scales, the economics shift. If your application is making millions of inference calls per day — a document processing pipeline, a high-traffic product feature, a customer-facing assistant — the token cost of a proprietary model can become a significant line item.

Self-hosting transfers that variable cost into a fixed infrastructure cost (compute, GPU capacity, engineering time to run it). Whether that trade is favourable depends on your specific volume, your GPU procurement strategy, and whether your team has the MLOps capacity to run the stack reliably. It is not automatically cheaper — but at sufficient scale, the crossover point is real.

You need to customise the model

Fine-tuning a proprietary model is either unavailable, expensive, or constrained by the vendor's terms. If your use case requires a model adapted to your domain — a specific legal corpus, your internal product knowledge, a particular writing style or schema — fine-tuning an open-weight model on your own data gives you full control over the process and the output.

This also applies to inference-time customisation: system prompts, constrained generation, structured output schemas. Open-weight models running locally or on your infrastructure let you modify the serving layer in ways a black-box API does not.

You have a specific latency or offline requirement

If your application needs to run in a latency-sensitive context (real-time audio, edge devices, on-premise deployments with restricted internet access), you cannot depend on a round-trip to an external API. Self-hosted models can run closer to the data, on-premise, or in air-gapped environments where API calls are simply not possible.

When Proprietary APIs Are the Better Choice

You are in early exploration or prototyping

The fastest path from idea to working prototype is a proprietary API. No infrastructure to provision, no model to serve, no GPU to configure. For most Australian teams validating whether an AI feature is worth building, starting with a proprietary API is the right call. The goal at that stage is learning, not optimising unit economics.

Once you have validated the use case and understand the call volume, you can revisit the build vs. buy decision with real data.

Task complexity requires frontier capability

Open-weight models have improved dramatically. For well-defined tasks — classification, extraction, summarisation, moderate-complexity generation — current Llama and Mistral variants perform competitively with proprietary models. For tasks requiring deep reasoning, complex multi-step planning, or nuanced instruction following across long contexts, frontier proprietary models still hold a meaningful capability lead.

Be honest about where your task sits on this spectrum. Self-hosting a smaller open-weight model to save on API costs, then finding it cannot handle your use case reliably, is not a saving — it is a delay.

Your team lacks the MLOps capacity

Running a self-hosted LLM in production is not trivial. You need to manage model serving infrastructure (vLLM, Ollama, TGI, or similar), GPU capacity, scaling behaviour under load, model versioning, monitoring for quality degradation, and security hardening. For a small engineering team already stretched across core product work, that operational surface is significant.

A proprietary API abstracts all of that. If your team does not have — or cannot hire — the MLOps depth to run the stack well, the "savings" of self-hosting will be consumed by operational incidents and engineering time.

You need multimodal or rapidly evolving capability

If your roadmap depends on capabilities that are evolving quickly — vision, code generation, agentic tool use — proprietary vendors are generally iterating faster than the open-weight ecosystem. You get access to new capabilities via an API version bump, not a re-deployment cycle.

A Practical Comparison Framework

Dimension	Proprietary API	Open-Weight Self-Hosted
Data residency	Depends on vendor and contract	Full control — host in AU region
Entry cost	Low — pay per token, no infra	Higher — GPU provisioning, ops setup
Cost at scale	Increases linearly with volume	Fixed infra cost; lower marginal cost
Capability (frontier tasks)	Higher — GPT-4, Claude lead	Lower — gap is narrowing
Customisation (fine-tuning)	Limited or vendor-controlled	Full control
Operational burden	Low — vendor managed	High — team owned
Latency / offline	Network-dependent	Can run on-premise or edge
Vendor lock-in	Higher	Lower — weights are portable
Time to first prototype	Fast	Slower

Qualitative ratings only — actual outcomes depend on your stack, team, and use case.

The Australian Mid-Market Reality

Most Australian mid-market technology teams — a SaaS company at 200 employees, a fintech scaling into enterprise, a healthtech with real data sensitivity — sit in a middle ground. They have genuine data residency concerns. They are not at the token volume where self-hosting obviously pencils out. And they do not have a dedicated MLOps engineer.

The pragmatic path for most of these teams is a hybrid model:

Use proprietary APIs for prototyping, high-complexity tasks, and features where capability matters more than cost.
Self-host open-weight models for high-volume, well-defined tasks where the model is "good enough" and data residency or cost are constraints.
Architect the application layer so the underlying model is swappable — avoid tight coupling to any single vendor's API surface.

That last point is underappreciated. The teams that get the most flexibility are the ones that design for model portability from the start, rather than refactoring after they've shipped.

If you're thinking through this architecture, our AI engineering work often starts with exactly this question: what belongs behind a proprietary API, what should run on your own infrastructure, and how do you build a serving layer that doesn't lock you in.

What Does Self-Hosting Actually Involve?

For teams considering self-hosting for the first time, it helps to be concrete about what's involved:

Infrastructure: You need GPU-capable compute. For most inference workloads, this means cloud instances with NVIDIA A10, A100, or H100 GPUs, or purpose-built GPU servers on-premise. AWS, Azure, and GCP all have relevant instance types available in Australian regions.

Model serving: Tools like vLLM, Text Generation Inference (TGI), and Ollama handle the serving layer. They manage batching, memory, and expose an API endpoint your application can call.

Model selection and quantisation: Smaller quantised models (4-bit or 8-bit versions) run on less GPU memory with modest quality tradeoffs. A 7B or 13B parameter model in a quantised format can run on a single GPU that costs a fraction of what a full-precision 70B model requires.

Monitoring and reliability: You own the uptime. You need to monitor for hardware failures, memory pressure, and model quality drift — none of which a proprietary API makes you think about.

Security: The model serving endpoint needs to be secured. If it's exposed on your network, it needs authentication and access control. An unsecured inference endpoint is a real attack surface.

This is not insurmountable, but it is not trivial. Teams that underestimate the operational surface tend to regret the choice — not because self-hosting is wrong, but because they weren't staffed for it.

Licensing: Read Before You Deploy

Not all open-weight models are free to use in production. Llama models carry Meta's community licence, which restricts commercial use for applications exceeding a certain monthly active user threshold. Mistral's models have varied licensing terms depending on the release.

Before you deploy an open-weight model for a commercial application, review the licence. This is not a bureaucratic formality — it is a real legal consideration that affects whether self-hosting is viable for your specific use case.

Making the Call

The open-source vs. proprietary question does not have a universal answer. It has a contextual answer that depends on your data sensitivity, your call volume, your team's MLOps capacity, and where your task sits on the capability spectrum.

The teams that make this decision well start with a clear view of those four dimensions before committing to infrastructure. They prototype on a proprietary API, validate the use case, then assess whether self-hosting is warranted once they have real usage data.

If your organisation is building an AI roadmap that touches these decisions — which model to use, where to run it, how to structure the serving layer — that's a question our AI product strategy work is designed to help with. We don't have a vendor preference. We help you make the call that fits your actual constraints.

You can also explore more posts on AI architecture and engineering decisions in our insights.

If you're working through the self-hosting decision for a specific application — or trying to figure out whether your current API costs justify a migration — start a conversation with our team. We're happy to look at your use case and give you a straight answer.

LLM AI engineering Self-Hosting Open Source AI AI infrastructure

Chris Kerr

Partner at Horizon Labs, an AI product consultancy and venture studio. A commercially focused product and technology leader with 20+ years building and scaling digital platforms, teams, and businesses across SaaS, travel, eCommerce, logistics and transport, and digital marketing — operating at the intersection of product, engineering, and data. Writes about platform strategy, AI transformation, modern data ecosystems, and the operational discipline that separates AI demos from AI products.

16 July 2026

Application Modernisation in Australia: The Complete 2025 Guide

A practical guide to application modernisation for Australian engineering leaders — covering patterns like strangler fig and re-architecture, architecture maturity trade-offs, and Australian-specific context including the Essential Eight and Hosting Certification Framework.

6 min readChris Kerr

14 July 2026

Planning an AI Engagement: What Production Delivery Requires

Before committing budget to an AI initiative, it's worth agreeing on what production-grade delivery actually means. This guide covers the standards worth setting for any AI engagement — from production track record to MLOps planning and IP ownership.

6 min readChris Kerr

8 July 2026

AI for Australian Manufacturing: 5 Use Cases That Work

Australian manufacturers are deploying production AI across five use cases today: predictive maintenance, computer vision quality inspection, document AI for compliance, demand forecasting, and procurement automation. This practitioner overview covers what makes each use case work in production — and where each one fails — for CTOs and engineering leaders evaluating where to start.

9 min readChris Kerr

Open-Source vs Proprietary LLMs: When to Self-Host