Hugging Face
Hugging Face is the open-source AI ecosystem's gravity well, and our default place to find models, datasets, and tooling when we step away from hosted LLMs. We use the Transformers library for self-hosted inference of Llama and Mistral variants, the Hub for model selection (especially embeddings and classification models), the Datasets library for fine-tuning corpora, and PEFT / TRL for the actual fine-tuning workflows. Hugging Face Inference Endpoints get a more measured recommendation — useful for prototyping but the production economics rarely beat AWS Bedrock or self-hosted vLLM at scale.
What you get
Real examples
Embedding model selection for RAG
Illustrative scenario: a healthcare client's RAG system shows poor retrieval quality. We benchmark 5 embedding models from Hugging Face (BGE, E5, Cohere, OpenAI, plus a domain-tuned one) on the client's actual query set. BGE-large-en wins by ~12% recall@10. Swap in, RAG quality improves measurably without changing the LLM.
Domain-tuned classifier for compliance triage
Illustrative scenario: an insurer needs higher accuracy than a general-purpose frontier model on insurance-specific document classification. We fine-tune distilBERT on Hugging Face Transformers with their labelled training data. Resulting model is 30× faster + 50× cheaper to run than a frontier LLM with better accuracy on the specific taxonomy.
Self-hosted Llama via Transformers + vLLM
Illustrative scenario: a sovereign-data client needs a Llama 4 model running in their VPC. Transformers handles model loading + tokenisation; vLLM provides the high-throughput inference server. Hugging Face Hub provides model weights + the deployment recipe.
Common questions
Why does Hugging Face matter when hosted LLMs exist?
Three things hosted LLMs can't do well. One, embedding model selection — RAG quality depends massively on the right embedding model, and the best ones are open-source on Hugging Face. Two, fine-tuning at depth — hosted fine-tuning is limited. Three, sovereign deployment — when the data can't leave your boundary, Hugging Face is where the deployable models live.
Hugging Face Inference Endpoints vs AWS Bedrock?
We default to Bedrock for production. Inference Endpoints are convenient for prototyping but the per-token economics and autoscaling story rarely beat Bedrock at scale for the same open-source model. Self-hosted via vLLM beats both on cost above ~$5K/month inference spend.
Which embedding model do you use for RAG?
Depends on the corpus + query distribution. Default to OpenAI text-embedding-3-large for fast prototyping; benchmark BGE-large, E5, Cohere, and any domain-specific options before production commitment. We've seen 10-20% retrieval-quality differences across models for the same RAG system — embedding choice matters more than most teams realise.
Do you fine-tune embedding models?
Rarely, but it's a real lever. When out-of-the-box embedding models underperform on a specific domain (medical, legal, hyper-technical), sentence-transformers + a labelled query set can produce a domain-tuned embedding model with measurable retrieval gains. Costs a day or two of engineering plus a small GPU bill.
How does the Hugging Face Hub fit your workflow?
We use it as the canonical model registry for open-source models — pinning to specific Hub revisions ensures reproducibility. We rarely publish public models to the Hub (clients prefer keeping fine-tuned weights private), but we do use private organisation repos for model artefact storage on a per-client basis.
Ready to get started?
Tell us about your project and we'll tell you honestly how we can help.
Get in Touch