Meta Llama
Llama is the open-source LLM we deploy when hosted APIs can't be used — data sovereignty, strict residency, full model ownership, or fine-tuning at a scale that hosted fine-tuning doesn't support. The Llama 4 family (Scout and Maverick) are credible alternatives to Claude Sonnet for many workloads (RAG, classification, structured extraction). The trade-off is operational: you take on the GPU infrastructure, the inference serving, and the monitoring that hosted APIs handle for you. We recommend Llama selectively — when sovereignty matters more than ops cost, or when fine-tuning is genuinely the right tool.
What you get
Real examples
On-premises deployment for sovereign data
Illustrative scenario: an Australian federal-government-adjacent client requires AI inference within their own data centre — no cloud APIs, no outbound traffic. A Llama 4 model deployed on dedicated H100 hardware, served via vLLM, integrated with the client's existing identity stack.
Domain fine-tuning at scale
Illustrative scenario: a legal-tech company needs an LLM tuned to Australian case law and contract language. We fine-tune a Llama 4 model on a curated corpus of judgments and contract templates. Deployed model outperforms general-purpose Claude on the client's specific evaluation suite.
Hybrid open + hosted architecture
Illustrative scenario: a healthcare client uses Llama for patient-data inference (sovereign requirement) and Claude for general business workflows (productivity). Same product, two model backends, routed by data sensitivity classification at the application layer.
Common questions
When is Llama the right call over Claude or GPT?
Three scenarios. One, hard sovereignty requirements where no inference can leave your boundary. Two, fine-tuning at a scale or specificity that hosted APIs don't support. Three, very high-volume workloads where GPU utilisation can be kept high enough that self-hosted Llama beats hosted-API cost per token. For most other workloads, hosted Claude or Gemini is operationally cheaper.
What's the operational overhead of running Llama?
Real but manageable. You need GPU infrastructure (A100s or H100s, owned or rented), an inference server (we default to vLLM), monitoring, autoscaling, and a deployment story. We typically deploy on AWS Bedrock or Azure AI Foundry first as the managed path, then move to dedicated infrastructure if cost or sovereignty demands it.
Llama 8B, 70B, or 405B?
70B is the sweet spot for production for most clients — comparable quality to Claude Sonnet on many tasks, runs on a single H100 with quantisation. 8B for edge / low-latency embedded use cases. 405B almost never — the operational cost rarely justifies the marginal quality gain over 70B for the workloads we ship.
How do you fine-tune Llama responsibly?
LoRA or QLoRA, not full-weight fine-tuning unless we have a specific reason. Always against a held-out evaluation set with measurable accuracy gains. Always with regression testing of safety behaviours — fine-tuning can degrade refusal patterns. Always documented with the training data + hyperparameters so the result is reproducible.
Can we migrate from a hosted API to Llama later?
Yes, this is a common path. Prototype on Claude or GPT, validate the product-market fit, then migrate to Llama once volume + sovereignty pressures justify the ops overhead. Because our abstraction layer is model-agnostic, this migration is typically a few weeks for the swap + evaluation harness, not a rebuild.
Ready to get started?
Tell us about your project and we'll tell you honestly how we can help.
Get in Touch