Horizon LabsHorizon Labs
Back to Insights
14 June 2026Updated 14 June 202612 min read

Data Infrastructure Consulting Australia: Building the Foundation for AI

Most growing Australian organisations don't have a data problem — they have a data infrastructure problem. This post covers the modern data stack (Snowflake, BigQuery, Databricks), ELT vs ETL trade-offs, data contract patterns, and Australian data sovereignty considerations, with a practical look at consolidating five warehouses into one.

Data Infrastructure Consulting Australia: Building the Foundation for AI

Most growing Australian organisations don't have a data problem — they have a data infrastructure problem. The data exists. It's scattered across five warehouses, three SaaS tools, two legacy databases, and a spreadsheet someone emailed last quarter. Before AI can create value, that foundation needs to be right. This is the core of what data infrastructure consulting actually delivers: not dashboards, not models — a reliable, queryable, trustworthy base that everything else can build on.


What Is Data Infrastructure and Why Does It Come Before AI?

Data infrastructure is the set of systems, pipelines, and patterns that move raw data from source systems into a form your organisation can query, trust, and act on. It includes ingestion pipelines, storage and compute platforms, transformation logic, data quality checks, access controls, and the governance layer that makes all of it auditable.

AI and machine learning models are only as good as the data they consume. A recommendation engine trained on inconsistent product data returns poor recommendations. A churn model built on incomplete customer records misses the customers most at risk. The infrastructure layer is not a prerequisite that gets in the way of AI — it is the reason AI works at all.

For data leaders at growing Australian companies, this creates a sequencing challenge: the business wants AI outcomes now, but the data foundations aren't ready to support them. The right answer is rarely "wait." It's "build the foundations in parallel with a defined AI use case so the investment has a clear payoff."


A Common Scenario: Five Warehouses, One Team, Zero Clarity

Consider a 200-person Australian SaaS company that has grown through a combination of product expansion and two small acquisitions. They now have:

A male data engineer looks stressed at a desk with three monitors showing different data tools and conflicting figures, while colleagues discuss a chart in the background of an Australian tech office.

  • A Redshift cluster from the original data team
  • A Google BigQuery project set up by the product team for event analytics
  • A SQL Server database inherited from an acquired company
  • A Salesforce-native reporting environment managed by RevOps
  • A Power BI dataset someone built manually from CSV exports

Every team has their own numbers. The sales team's revenue figure doesn't match the finance team's. Churn is calculated three different ways depending on who you ask. The data engineering team — two people — spends most of their time answering questions and reconciling discrepancies rather than building.

This organisation is not unusual. It's representative of what happens when data tooling follows team decisions rather than a platform strategy. Consolidating these five sources into a single, well-modelled warehouse isn't a cleanup exercise — it's the precondition for every AI initiative on the roadmap.

The consolidation project typically involves three parallel workstreams: choosing a target platform, defining a data modelling standard, and establishing ingestion and transformation pipelines that all source systems flow into. Each of these has real architectural decisions attached to it.


Choosing a Modern Data Platform: Snowflake, BigQuery, and Databricks

Three platforms dominate the modern Australian enterprise data stack. They are meaningfully different — in architecture, pricing model, ecosystem fit, and the kinds of workloads they handle well.

Snowflake

Snowflake is a cloud-native data warehouse with a separation of storage and compute. You pay for storage and for compute credits used during queries. It's platform-agnostic — it runs on AWS, Azure, or GCP — which matters if your organisation hasn't committed to a single cloud provider. Snowflake's data sharing capabilities are genuinely strong: you can share live data with external partners without copying it.

It's a good fit for organisations that want a clean, SQL-first warehouse with strong governance features, a rich ecosystem of connectors, and predictable operational overhead. The per-credit pricing model can escalate with heavy concurrent query workloads, so cost management requires attention.

Google BigQuery

BigQuery is Google Cloud's serverless analytics warehouse. There's no infrastructure to manage — you query data and pay for the bytes scanned (or use flat-rate pricing for predictable cost). It handles very large query volumes well and integrates tightly with the Google Cloud ecosystem, including Vertex AI for ML workloads.

BigQuery is a natural choice if you're already on GCP, or if your team does significant event-level analytics at scale. The serverless model reduces operational overhead considerably. The trade-off is that you're more tightly coupled to the Google Cloud ecosystem.

Databricks

Databricks is built on Apache Spark and is a unified data and AI platform — it handles both large-scale data engineering and machine learning workloads in the same environment. It's less purely a warehouse and more a lakehouse platform: you can store raw data in open formats (Delta Lake) alongside structured, curated layers.

Databricks is the strongest choice when your workloads include significant ML training, feature engineering, or real-time streaming — not just batch analytics. It has higher operational complexity than Snowflake or BigQuery, so it makes more sense when you have data engineers and ML engineers who will actually use the advanced capabilities.

A Qualitative Comparison

DimensionSnowflakeBigQueryDatabricks
Primary use caseSQL analytics, data sharingSQL analytics at scaleData engineering + ML
ML / AI workloadsVia partner integrationsVia Vertex AINative, strong
Cloud flexibilityMulti-cloudGCP-nativeMulti-cloud
Operational complexityLowVery lowMedium–High
Open format storageLimitedLimitedStrong (Delta Lake)
Best fitSQL-first teams, governance focusGCP-native, event analyticsUnified data + AI teams

For the 200-person scenario above — consolidating five sources, two engineers, SQL-first team — Snowflake or BigQuery are typically the right starting points. Databricks becomes relevant once the ML workloads justify the complexity overhead.


ELT vs ETL: Which Pattern Fits Your Stack?

ELT (Extract, Load, Transform) and ETL (Extract, Transform, Load) describe the order in which raw data is processed before it becomes available for analytics.

ETL transforms data before loading it into the warehouse. The transformation logic lives in an intermediate layer — often a custom pipeline or a legacy tool like Informatica or SSIS. This made sense when warehouse storage and compute were expensive: you only loaded what you needed, pre-processed.

ELT loads raw data into the warehouse first, then transforms it using SQL (typically with a tool like dbt). This is the dominant modern pattern because cloud warehouses make storage cheap and compute scalable. Raw data is preserved, transformation logic is version-controlled, and you can re-transform historical data when business definitions change.

For most Australian organisations building on Snowflake, BigQuery, or Databricks today, ELT with dbt is the default recommendation. The exceptions:

  • Sensitive or regulated data that must be masked or de-identified before it enters the warehouse — in this case, some transformation or anonymisation at the ingestion layer is appropriate
  • High-volume streaming data where per-event transformation before loading reduces downstream processing cost
  • Legacy ETL tooling already in place with significant business logic embedded — the migration cost may not justify a full switch to ELT immediately

The practical implication for the five-warehouse consolidation: you'll likely ingest raw data from each source into a staging layer in your target platform, then apply a dbt transformation layer to produce clean, consistently-defined business entities — customers, orders, subscriptions, events — that all teams query against the same definitions.


Data Contracts: Making Pipelines Reliable at Scale

A data contract is a formal agreement between a data producer (a source system or team) and a data consumer (a pipeline, model, or analytics layer) that defines the expected schema, semantics, and quality guarantees of a data feed.

Without data contracts, pipelines break silently. A product team renames a column. An upstream API changes its response shape. A new market is added and a country code field starts returning values the pipeline wasn't built to handle. The pipeline keeps running, but the downstream data is wrong — and nobody knows until someone spots an anomaly in a dashboard.

Data contracts address this by making the interface between producer and consumer explicit and testable. In practice, a contract specifies:

  • Schema: field names, types, and nullability
  • Semantics: what a field means (e.g. created_at is UTC, not local time)
  • Quality rules: acceptable null rates, value ranges, referential integrity expectations
  • SLA: expected delivery frequency and latency

Tools like Soda, Great Expectations, and dbt tests are commonly used to encode and enforce these contracts. The organisational side — getting source teams to treat their data outputs as an API with consumers — is often harder than the tooling.

For data leaders managing a consolidation project, introducing data contracts at the ingestion boundary is a pragmatic first step: each source system's data is validated against its contract before it flows into the staging layer. This surfaces data quality issues early, where they're easier to fix, rather than downstream in models or ML features where the root cause is harder to trace.


Australian Data Sovereignty: What It Means for Your Platform Choice

Australian data sovereignty requirements aren't hypothetical. For organisations handling personal information under the Privacy Act 1988 and the Australian Privacy Principles (APPs), cross-border data transfer obligations are real and affect platform architecture decisions.

A data engineer viewed through a half-open glass meeting room doorway sits at a dual-monitor desk in a darkened Australian tech office, her face lit by screen glow, with a whiteboard behind her showing hand-drawn data architecture diagrams annotated with data residency notes.

Key considerations:

Data residency: All three major platforms — Snowflake, BigQuery, and Databricks — offer Australian regions (AWS Sydney/Melbourne, GCP Australia, Azure Australia East). Data can be kept onshore. This should be configured explicitly — it is not always the default, particularly for trial or developer accounts.

Cross-border transfers: If data flows to services outside Australia — for example, a US-based SaaS ingestion tool, or an ML API — you need to satisfy APP 8, which governs the disclosure of personal information to overseas recipients. This typically requires either recipient country adequacy assessment or explicit contractual protections.

Sector-specific requirements: Health data handled under the My Health Records Act 2012 and financial data subject to APRA prudential standards carry additional obligations. Fintech and healthtech organisations need to map their data flows carefully before selecting ingestion tools and cloud regions.

Practical architecture implication: For most Australian mid-market organisations, choosing an Australian cloud region for your primary warehouse, using Australian-region ingestion tooling where available, and documenting data flows for cross-border transfers is sufficient to satisfy baseline Privacy Act obligations. Organisations with regulated data (health, financial) should seek legal review of their data architecture before finalising platform selection.

This is an area where data infrastructure consulting with Australian market experience matters — not because the technology is different, but because the governance and compliance layer needs to be designed in, not bolted on.


What Good Data Infrastructure Consulting Looks Like in Practice

A data infrastructure engagement isn't a requirements-gathering exercise followed by a lengthy design document. It's an iterative build.

A typical structure for a consolidation project like the one described above:

Phase 1 — Assessment and platform selection (2–4 weeks): Audit existing sources, data volumes, query patterns, team capabilities, and compliance requirements. Select target platform and tooling. Define the data modelling standard (usually dimensional modelling or a variation) and agree on the transformation framework (usually dbt).

Phase 2 — Foundation build (4–8 weeks): Stand up the target platform. Build ingestion pipelines for priority sources using a tool like Fivetran, Airbyte, or custom connectors. Implement staging and base transformation layers in dbt. Add data contract tests at the ingestion boundary.

Phase 3 — Business layer and enablement (4–8 weeks): Build mart-layer models that represent agreed business entities. Migrate critical dashboards and reports. Train the internal team on the new stack. Document the data model.

The internal team should be operating the stack independently by the end of Phase 3. A good consulting engagement leaves you with ownership, not dependency.

This kind of structured data infrastructure work also directly enables downstream AI work. Once your data is clean, consistently modelled, and accessible through a single platform, the path to AI engineering — whether that's ML models, LLM-powered features, or AI agents — becomes significantly shorter. The same foundation that fixes your dashboards is the foundation your AI uses.

If your organisation is also thinking about the broader technology architecture — not just data but application and platform modernisation — it's worth understanding how application modernisation fits alongside a data infrastructure investment. Often the two are related: the legacy application is the source system that's hardest to ingest from.


Common Mistakes to Avoid

Choosing the platform before understanding the workload. Platform selection should follow from your actual query patterns, team skills, cloud commitments, and compliance requirements — not from a vendor demo.

Treating the warehouse as the destination, not the starting point. A clean warehouse enables analytics, but it also enables ML features, AI product capabilities, and real-time operational decisions. Design for the full use case.

Underestimating the data modelling work. The technology is the easy part. Agreeing on what a "customer" is across five source systems, and encoding that definition consistently, is the hard part. Budget time for it.

Skipping governance until "later." Access controls, audit logging, and data lineage are much cheaper to implement at build time than to retrofit. This is especially true for Australian organisations with Privacy Act obligations.

Building a team dependency on the consulting partner. A good engagement transfers knowledge and capability to your internal team. If the consulting partner is the only one who understands the architecture after six months, that's a problem.


Getting Started with Data Infrastructure Consulting in Australia

For most data leaders, the right starting point isn't a full platform build — it's a structured assessment that maps your current state, identifies the highest-priority gaps, and produces a phased roadmap with clear platform recommendations.

This typically takes two to four weeks and produces three things: a data architecture recommendation with rationale, a phased build plan tied to specific business outcomes (usually an AI use case or a critical analytics need), and an honest view of what your internal team can own versus where you need external support.

If you're exploring what a data infrastructure engagement might look like for your organisation — whether that's a consolidation project, a foundation build for AI, or a review of your current stack — we're happy to talk through it. No slide decks required.

For more on how data infrastructure connects to broader AI adoption, see our insights on AI readiness and AI engineering.

Share

Chris Kerr

Partner at Horizon Labs, an AI product consultancy and venture studio. A commercially focused product and technology leader with 20+ years building and scaling digital platforms, teams, and businesses across SaaS, travel, eCommerce, logistics and transport, and digital marketing — operating at the intersection of product, engineering, and data. Writes about platform strategy, AI transformation, modern data ecosystems, and the operational discipline that separates AI demos from AI products.