GPU Compute (A100 / H100)
Most production AI doesn't need dedicated GPU infrastructure — hosted APIs handle inference fine. But when GPU compute is the right answer, the difference between getting it right and wrong is significant: idle GPU hours cost more per month than entire teams. We provision GPU compute on AWS (P5 instances), GCP (A3), Azure (NDv5), or specialised providers (Lambda, CoreWeave) based on workload economics — sustained-utilisation workloads tip toward dedicated infrastructure; bursty training favours spot instances; experimental workflows go to lower-cost hosted providers. The goal is always utilisation: a GPU running at 80% justifies its cost; one running at 15% doesn't.
What you get
Real examples
Self-hosted Llama 70B inference on H100s
Illustrative scenario: a sovereign-data client needs a Llama 4 model running in their VPC. We provision dual H100s on AWS P5 with vLLM as the inference server, autoscaling between 1 and 4 GPUs based on request volume. Sustained 60-70% utilisation; cost-effective above ~3M tokens/day inference.
Spot training for a custom recommendation model
Illustrative scenario: a media company trains a recommendation model weekly on their full catalogue. AWS Spot A100s + SageMaker handle the training; W&B tracks runs; checkpoint resumption handles spot interruptions. Cost per training run drops to ~35% of on-demand pricing.
Hybrid GPU strategy across experimentation + production
Illustrative scenario: a research-led AI startup needs cheap GPU for ad-hoc experimentation but reliable infrastructure for production. We split: Lambda / RunPod for daily experimentation (low ops, low cost), AWS P5 reserved instances for production serving (high reliability, predictable cost).
Common questions
When do we actually need dedicated GPU compute?
Three cases. Self-hosted LLM inference at high volume (above ~$5K/month spent on hosted APIs typically tips toward self-hosting). Custom model training (basically always GPU). Cases where data can't leave a sovereign boundary. Outside those, hosted APIs are operationally cheaper and we recommend them.
A100 or H100?
A100 for most production fine-tuning and inference of 7-70B models — cheaper, widely available, mature. H100 when the workload genuinely needs the extra throughput (frontier model training, 100B+ inference, very high request rates). The H100 premium is rarely worth it for the workloads we typically run.
Spot vs reserved vs on-demand?
Spot for any training where interruption is acceptable — usually all of it with proper checkpointing. Reserved for production serving with predictable utilisation — discount of 30-60% justifies the lock-in. On-demand for short-duration burst workloads or experimentation. Most projects end up mixing all three.
Australian-region availability?
Limited and expensive. AWS Sydney has A100s (P4d) but waitlists are real; H100s rare. GCP Sydney has A100s on A2 instances. For training, we often route to US regions (cheaper, more available) since training jobs don't have user-facing latency. For inference with AU customers, Sydney + Singapore is the typical routing — adds latency but keeps data residency in-region.
Specialised providers — Lambda, CoreWeave, RunPod — are they production-ready?
Lambda + CoreWeave yes for production at smaller scale; both have legitimate uptime and support. RunPod and Vast.ai are great for experimentation but we wouldn't put production serving on them. Bigger picture: specialised providers can be 30-50% cheaper than hyperscalers but trade off on tooling integration, compliance certifications, and enterprise support — match the choice to the workload.
Ready to get started?
Tell us about your project and we'll tell you honestly how we can help.
Get in Touch