Open-Source vs Proprietary LLMs: When to Self-Host
Self-hosting open-weight models like Llama or Mistral gives Australian teams data residency, cost control at scale, and deployment flexibility — but it comes with real infrastructure and operational tradeoffs. This post works through when each approach makes sense, with a practical framework for mid-market teams deciding between open-weight self-hosting and proprietary APIs like GPT-4 or Claude.

Open-Source vs Proprietary LLMs: When to Self-Host
The decision to self-host an open-weight model like Llama or Mistral — rather than call a proprietary API like GPT-4 or Claude — is one of the most consequential infrastructure choices an AI engineering team can make. Get it right and you get cost control, data residency, and deployment flexibility. Get it wrong and you get an underpowered model, a GPU bill you didn't budget for, and an ops burden your team wasn't staffed to carry.
This post works through the real tradeoffs. No benchmarks cherry-picked to flatter a preference. Just an honest account of when each path makes sense for Australian mid-market teams.
What Are Open-Weight Models?
An open-weight model is a large language model whose trained parameters are publicly released, allowing anyone to download, run, and modify the model on their own infrastructure. Llama (Meta) and Mistral are the most widely deployed examples. "Open-weight" is more precise than "open-source" because the training data and some licensing terms vary — but in practice, these models can be self-hosted without paying a per-token fee to a vendor.

Proprietary models — GPT-4 (OpenAI), Claude (Anthropic), Gemini (Google) — are accessed exclusively via API. You send a request, the vendor's infrastructure runs inference, you receive a response. You never hold the weights.
The Core Tradeoff in One Sentence
Proprietary APIs give you the best available capability at the lowest entry cost; open-weight self-hosting gives you control, data residency, and lower marginal cost at scale — but requires you to own the infrastructure and accept a capability gap on the most demanding tasks.

When Self-Hosting Open-Weight Models Makes Sense
Your data cannot leave your environment
Data residency is the clearest argument for self-hosting. If your application processes personal information governed by the Privacy Act 1988 and Australian Privacy Principles, or if you operate in a regulated sector — financial services, health, legal — sending that data to a US-based API endpoint creates jurisdictional risk that your legal team will rightly flag.
Propriety API vendors have improved their data handling commitments significantly (Azure OpenAI, for example, offers Australian region endpoints and contractual data residency commitments), but not every team has the procurement leverage or legal bandwidth to negotiate those terms. Self-hosting on your own cloud tenancy in ap-southeast-2 (Sydney) or ap-southeast-4 (Melbourne) removes the question entirely.
Your call volume is high and the unit economics have inverted
Proprietary APIs price on tokens. At low volume, the operational simplicity of an API is worth the per-token cost. As volume scales, the economics shift. If your application is making millions of inference calls per day — a document processing pipeline, a high-traffic product feature, a customer-facing assistant — the token cost of a proprietary model can become a significant line item.
Self-hosting transfers that variable cost into a fixed infrastructure cost (compute, GPU capacity, engineering time to run it). Whether that trade is favourable depends on your specific volume, your GPU procurement strategy, and whether your team has the MLOps capacity to run the stack reliably. It is not automatically cheaper — but at sufficient scale, the crossover point is real.
You need to customise the model
Fine-tuning a proprietary model is either unavailable, expensive, or constrained by the vendor's terms. If your use case requires a model adapted to your domain — a specific legal corpus, your internal product knowledge, a particular writing style or schema — fine-tuning an open-weight model on your own data gives you full control over the process and the output.
This also applies to inference-time customisation: system prompts, constrained generation, structured output schemas. Open-weight models running locally or on your infrastructure let you modify the serving layer in ways a black-box API does not.
You have a specific latency or offline requirement
If your application needs to run in a latency-sensitive context (real-time audio, edge devices, on-premise deployments with restricted internet access), you cannot depend on a round-trip to an external API. Self-hosted models can run closer to the data, on-premise, or in air-gapped environments where API calls are simply not possible.
When Proprietary APIs Are the Better Choice
You are in early exploration or prototyping
The fastest path from idea to working prototype is a proprietary API. No infrastructure to provision, no model to serve, no GPU to configure. For most Australian teams validating whether an AI feature is worth building, starting with a proprietary API is the right call. The goal at that stage is learning, not optimising unit economics.
Once you have validated the use case and understand the call volume, you can revisit the build vs. buy decision with real data.
Task complexity requires frontier capability
Open-weight models have improved dramatically. For well-defined tasks — classification, extraction, summarisation, moderate-complexity generation — current Llama and Mistral variants perform competitively with proprietary models. For tasks requiring deep reasoning, complex multi-step planning, or nuanced instruction following across long contexts, frontier proprietary models still hold a meaningful capability lead.
Be honest about where your task sits on this spectrum. Self-hosting a smaller open-weight model to save on API costs, then finding it cannot handle your use case reliably, is not a saving — it is a delay.
Your team lacks the MLOps capacity
Running a self-hosted LLM in production is not trivial. You need to manage model serving infrastructure (vLLM, Ollama, TGI, or similar), GPU capacity, scaling behaviour under load, model versioning, monitoring for quality degradation, and security hardening. For a small engineering team already stretched across core product work, that operational surface is significant.
A proprietary API abstracts all of that. If your team does not have — or cannot hire — the MLOps depth to run the stack well, the "savings" of self-hosting will be consumed by operational incidents and engineering time.
You need multimodal or rapidly evolving capability
If your roadmap depends on capabilities that are evolving quickly — vision, code generation, agentic tool use — proprietary vendors are generally iterating faster than the open-weight ecosystem. You get access to new capabilities via an API version bump, not a re-deployment cycle.
A Practical Comparison Framework
| Dimension | Proprietary API | Open-Weight Self-Hosted |
|---|---|---|
| Data residency | Depends on vendor and contract | Full control — host in AU region |
| Entry cost | Low — pay per token, no infra | Higher — GPU provisioning, ops setup |
| Cost at scale | Increases linearly with volume | Fixed infra cost; lower marginal cost |
| Capability (frontier tasks) | Higher — GPT-4, Claude lead | Lower — gap is narrowing |
| Customisation (fine-tuning) | Limited or vendor-controlled | Full control |
| Operational burden | Low — vendor managed | High — team owned |
| Latency / offline | Network-dependent | Can run on-premise or edge |
| Vendor lock-in | Higher | Lower — weights are portable |
| Time to first prototype | Fast | Slower |
Qualitative ratings only — actual outcomes depend on your stack, team, and use case.
The Australian Mid-Market Reality
Most Australian mid-market technology teams — a SaaS company at 200 employees, a fintech scaling into enterprise, a healthtech with real data sensitivity — sit in a middle ground. They have genuine data residency concerns. They are not at the token volume where self-hosting obviously pencils out. And they do not have a dedicated MLOps engineer.
The pragmatic path for most of these teams is a hybrid model:
- Use proprietary APIs for prototyping, high-complexity tasks, and features where capability matters more than cost.
- Self-host open-weight models for high-volume, well-defined tasks where the model is "good enough" and data residency or cost are constraints.
- Architect the application layer so the underlying model is swappable — avoid tight coupling to any single vendor's API surface.
That last point is underappreciated. The teams that get the most flexibility are the ones that design for model portability from the start, rather than refactoring after they've shipped.
If you're thinking through this architecture, our AI engineering work often starts with exactly this question: what belongs behind a proprietary API, what should run on your own infrastructure, and how do you build a serving layer that doesn't lock you in.
What Does Self-Hosting Actually Involve?
For teams considering self-hosting for the first time, it helps to be concrete about what's involved:
Infrastructure: You need GPU-capable compute. For most inference workloads, this means cloud instances with NVIDIA A10, A100, or H100 GPUs, or purpose-built GPU servers on-premise. AWS, Azure, and GCP all have relevant instance types available in Australian regions.
Model serving: Tools like vLLM, Text Generation Inference (TGI), and Ollama handle the serving layer. They manage batching, memory, and expose an API endpoint your application can call.
Model selection and quantisation: Smaller quantised models (4-bit or 8-bit versions) run on less GPU memory with modest quality tradeoffs. A 7B or 13B parameter model in a quantised format can run on a single GPU that costs a fraction of what a full-precision 70B model requires.
Monitoring and reliability: You own the uptime. You need to monitor for hardware failures, memory pressure, and model quality drift — none of which a proprietary API makes you think about.
Security: The model serving endpoint needs to be secured. If it's exposed on your network, it needs authentication and access control. An unsecured inference endpoint is a real attack surface.
This is not insurmountable, but it is not trivial. Teams that underestimate the operational surface tend to regret the choice — not because self-hosting is wrong, but because they weren't staffed for it.
Licensing: Read Before You Deploy
Not all open-weight models are free to use in production. Llama models carry Meta's community licence, which restricts commercial use for applications exceeding a certain monthly active user threshold. Mistral's models have varied licensing terms depending on the release.
Before you deploy an open-weight model for a commercial application, review the licence. This is not a bureaucratic formality — it is a real legal consideration that affects whether self-hosting is viable for your specific use case.
Making the Call
The open-source vs. proprietary question does not have a universal answer. It has a contextual answer that depends on your data sensitivity, your call volume, your team's MLOps capacity, and where your task sits on the capability spectrum.
The teams that make this decision well start with a clear view of those four dimensions before committing to infrastructure. They prototype on a proprietary API, validate the use case, then assess whether self-hosting is warranted once they have real usage data.
If your organisation is building an AI roadmap that touches these decisions — which model to use, where to run it, how to structure the serving layer — that's a question our AI product strategy work is designed to help with. We don't have a vendor preference. We help you make the call that fits your actual constraints.
You can also explore more posts on AI architecture and engineering decisions in our insights.
If you're working through the self-hosting decision for a specific application — or trying to figure out whether your current API costs justify a migration — start a conversation with our team. We're happy to look at your use case and give you a straight answer.
Chris Kerr
Founder of Horizon Labs. Twenty years building production software for Australian mid-market businesses, the last seven focused on putting AI into systems that operate at 3am without anyone watching. Writes about strategy, fractional CTO work, and the operational discipline that separates AI demos from AI products.


