Data Engineering Foundations Every ML Team Needs
Key takeaways
- Most ML teams stall because their data is unreliable, undocumented, or missing — not because their models are wrong.
- A production ML stack has five layers: ingestion, storage, transformation, serving, and observability. Skipping any of them creates debt that slows every future model.
- Start with a warehouse (Snowflake, BigQuery, or Redshift) unless you have a real lakehouse workload. Databricks is the default when you do.
- Pick Airflow for orchestration unless Dagster's asset model or Prefect's dynamic workflows fit your team better. Do not split across multiple tools.
- Data contracts and tests in dbt plus Great Expectations or Soda are the cheapest insurance you can buy against silent model failure.
Why ML teams stall without data foundations
Most ML teams do not stall because their algorithms are weak. They stall because they cannot get clean, fresh, consistent data into the shape a model needs, at the time it needs it. The symptoms are predictable: a data scientist spends three weeks "just exploring the data" that was supposed to take three days. A model that passed evaluation fails in production because the feature computed offline does not match the feature computed online. A dashboard starts showing nulls on a Monday and nobody notices until Thursday.
Every one of these stories is a data engineering problem wearing ML clothing. The fix is almost never a better model; it is better pipelines, better contracts, and better observability on the data itself. The ML team that ignores this builds on sand. The one that invests in it builds a platform that makes the tenth model cheaper than the first.
The order of operations matters. We have seen teams buy a feature store before they had reliable daily pipelines. We have seen teams adopt a lakehouse before they could explain where their source-of-truth customer table lived. These are expensive inversions. Foundations go in first for a reason.
The ML team that ignores data engineering builds on sand. The one that invests in it builds a platform that makes the tenth model cheaper than the first.
The five-layer ML data stack
A production data stack for ML has five layers. Each one does a specific job. Each one can be simple or complex depending on scale, but none of them is optional.
1. Ingestion
Ingestion is how data arrives from source systems — application databases, event streams, SaaS APIs, file drops. The goals are reliability, idempotency, and appropriate freshness. Common choices: Fivetran, Airbyte, or Stitch for SaaS sources; Debezium or change-data-capture for operational databases; Kafka or managed equivalents for streams. For most mid-market teams, Fivetran plus a CDC tool covers 90% of the surface area with predictable cost.
2. Storage
Storage is where data lives after it arrives. This is almost always a warehouse (Snowflake, BigQuery, Redshift) or a lakehouse (Databricks, or an open combination like Iceberg plus Trino). The choice matters for cost, flexibility, and which teams you can hire. We cover the decision below.
3. Transformation
Transformation is where raw data becomes the clean, modelled tables that downstream analytics and ML consume. The clear winner in 2026 is dbt. A well-organized dbt project with staging, intermediate, and mart layers, plus a consistent naming convention, buys ML teams more velocity than any other single investment. If you are writing transformations as raw SQL scripts, migrate to dbt before doing anything else on this list.
4. Serving
Serving is how cleaned data reaches the places that use it: BI tools, notebooks, ML training jobs, and — when latency matters — online inference endpoints. For batch ML, the warehouse is usually sufficient. For real-time inference, you add a fast key-value store (Redis, DynamoDB, Bigtable) or a feature store that handles the online-offline split for you.
5. Observability
Observability is how you know the data is correct right now, not yesterday. This layer includes tests (dbt tests, Great Expectations, Soda), freshness checks, row count anomaly detection, schema drift alerts, and lineage. Tools like Monte Carlo, Metaplane, and Elementary do this well; so does a disciplined dbt project with aggressive test coverage. What matters is that a data incident is detected by the system, not by the data scientist debugging a model at 11 p.m.
Skip any layer and the others compensate badly. Skip observability and you accumulate silent data bugs. Skip transformation and every model builds its own truth. Skip ingestion reliability and every downstream system inherits the instability.
Choosing a warehouse or lakehouse
The warehouse-versus-lakehouse question has become less sharp over the last two years as the platforms have converged, but the practical decision is still meaningful. The right answer depends on workload shape, existing skills, and where the team wants to be in three years.
| Platform | Best fit | ML strengths | Watch out for |
|---|---|---|---|
| Snowflake | Structured, BI-heavy workloads with ML adjacency | Excellent SQL, Snowpark for Python, Cortex for built-in ML | Storage plus compute bills can surprise teams without governance |
| BigQuery | Teams already on GCP, large exploratory workloads | Native BigQuery ML, Vertex integration, good serverless economics at small scale | Slot contention under concurrent workloads, costs scale with scanned data |
| Databricks | Mixed structured and unstructured, heavy ML and streaming | Unified Spark for training and ETL, MLflow, Delta Lake, strong for deep learning | Steeper learning curve, more operational surface area than a pure warehouse |
| Redshift | Teams on AWS with existing Redshift investments | Serverless variants, Redshift ML, tight integration with SageMaker | Historically slower iteration than Snowflake, governance is work-in-progress |
Our default recommendation for mid-market teams starting fresh: Snowflake or BigQuery, depending on cloud preference. Add Databricks only when you have a clear lakehouse workload — large-scale feature engineering on unstructured data, heavy Spark usage, or a clear deep learning roadmap. Running two platforms side by side is expensive and confusing; do it only when the workloads cannot coexist in one.
Orchestration — Airflow, Dagster, Prefect
Orchestration is the control plane that schedules and coordinates jobs across the other layers. Three tools dominate the modern stack, and the choice is less about capability than about team shape.
Airflow
The safe enterprise default. The ecosystem is enormous; every vendor has an Airflow operator; every senior data engineer has shipped Airflow before. Pick Airflow when you value ecosystem breadth, hiring predictability, and battle-tested reliability. The trade-off is that pipeline definitions are DAG-first, not data-first, so lineage and typed assets require additional work.
Dagster
Built around data assets as first-class citizens. Pipelines are defined by what they produce, not by task graphs, which changes the mental model in useful ways: you can ask "what is the state of this table?" directly. Pick Dagster when your team values strong typing, clean lineage, and asset-based thinking. The ecosystem is smaller than Airflow's but mature enough for most stacks.
Prefect
Pythonic, flexible, and strong at dynamic workflows where the DAG shape is not known at definition time. Pick Prefect for event-driven or ML-heavy workflows where Airflow's static DAG model gets in the way. The trade-off is a smaller enterprise footprint and fewer senior engineers who have run it at scale.
Decision heuristic
If you are starting fresh and the team is not opinionated, pick Airflow (managed: MWAA, Cloud Composer, or Astronomer). If you are on Dagster or Prefect already and happy, stay. Do not run two orchestrators in parallel — it is always more expensive than it looks and always ends with a consolidation project nobody wanted.
Data quality contracts and testing
The single biggest source of production ML incidents is silent data quality drift: a field that used to be populated now has 5% nulls; a categorical that used to have four values now has six; a schema change on Tuesday breaks the downstream model on Friday. Data contracts and tests exist to make these incidents loud and fast.
What a data contract contains
A contract is an explicit, versioned agreement between the producer of a dataset and its consumers. At minimum it specifies:
- Schema (columns, types, nullability)
- Semantics (what each column means, units, allowed values)
- Freshness SLA (how old is "too old")
- Ownership (which team, which on-call, which escalation path)
- Change policy (what requires a version bump, what is backward compatible)
Contracts live in code — typically YAML or JSON Schema in the same repo as the producing pipeline — and are tested in CI. A change that breaks the contract breaks the build. This sounds like overhead; it is actually insurance. The alternative is learning about the break from the consumer, usually via a broken model.
The test stack
Three tools cover most of what you need, and they compose well.
- dbt tests. Cover structural invariants: not-null, uniqueness, referential integrity, accepted values, custom SQL assertions. Run on every model build. Failures block downstream materialization when configured correctly.
- Great Expectations. Cover semantic and statistical invariants: row counts within expected ranges, distribution checks, column value ranges, correlation sanity checks. Run on a schedule, independent of the main pipeline, so they catch drift that dbt tests miss.
- Soda. Cover cross-system freshness and SLA checks: this table should be updated within 4 hours of the source; this API should match this warehouse column. Complements dbt and Great Expectations rather than replacing them.
Start with dbt tests on every mart-layer table. Add Great Expectations for the three to five tables that ML models depend on. Add Soda when cross-system freshness becomes the pain point. Do not try to boil the ocean on day one. Tests accumulate value the way any quality investment does: slowly at first, then rapidly.
Feature stores — when you actually need one
Feature stores (Feast, Tecton, Databricks Feature Store, Hopsworks) solve three specific problems: point-in-time correctness, online-offline consistency, and feature reuse across multiple models. Those are real problems — but they are not everyone's problems.
Consider a feature store when at least two of these apply:
- You serve real-time predictions with tight latency budgets (typically below 100 ms)
- Multiple production models share overlapping features, and recomputing them is expensive or inconsistent
- You need strict point-in-time correctness during training to avoid label leakage in temporal data
If none of these apply, a disciplined dbt project plus a thin caching layer gets you further than a feature store. Buying complexity you cannot justify is expensive; the feature store sits in the critical path of every model and becomes its own reliability problem.
When you do need one, Feast is a reasonable open-source starting point; Tecton and Databricks Feature Store are strong managed options. Whichever you pick, integrate it with your orchestration and observability layers from day one so features are not a black box.
A reference architecture
Here is the stack we recommend most often for a mid-market team starting fresh or consolidating. It is deliberately boring.
Sources Ingestion Storage Transformation Serving / ML
--------------- ------------ ---------------- ------------------- -----------------
App databases ----> Fivetran ----> Snowflake ---> dbt (staging / marts) BI: Looker / Metabase
Event streams ----> Kafka + CDC ----> (or BigQuery, ---> + dbt tests Notebooks: Hex
SaaS APIs ----> Fivetran ----> Databricks) ---> + Great Expectations Batch ML: Airflow + MLflow
Online: Feature store
Observability: Monte Carlo
Orchestration: Airflow (managed) + dbt freshness
Lineage + catalog: dbt docs / DataHub + Soda checks
Read left to right: data enters, lands in a single storage layer, is modelled into trusted marts, then served to analytics and ML consumers. Orchestration sits across the top. Observability sits across the bottom. Every arrow has a contract, a test, and an owner.
Nothing here is exotic. That is the point. Exotic architectures are how ML teams collect technical debt; boring architectures are how they ship. When teams ask us for an opinion on what data engineering and MLOps should look like at their scale, this is usually the diagram we start from.
If you want to go deeper on adjacent topics, related reading on MLOps from notebook to production covers what happens to the model after the data is right, and measuring AI ROI and business impact covers how to tie the whole stack back to numbers the CFO cares about. For teams whose next step is agents rather than models, building AI agents that work in production picks up where this post ends.
Frequently asked questions
Do we need a data warehouse or a lakehouse for ML?
If your workloads are mostly structured and BI-heavy, a warehouse like Snowflake or BigQuery is simpler and faster to adopt. If you have significant unstructured or semi-structured data (logs, images, documents) and want ML and analytics on the same storage, a lakehouse like Databricks makes more sense. Most mid-market ML teams start with a warehouse and add lakehouse components later.
When do we actually need a feature store?
You need a feature store when at least two of the following are true: you serve real-time predictions at low latency, multiple models share overlapping feature sets, or you need point-in-time correctness for training to avoid label leakage. If none of those apply, a well-organized dbt project and a caching layer will get you further than buying a feature store you cannot justify.
Should we pick Airflow, Dagster, or Prefect?
Airflow is the safe enterprise default with the widest ecosystem. Dagster is a better fit when data assets (tables, features, models) are first-class citizens and the team values lineage and typed pipelines. Prefect shines for dynamic workflows and teams that prefer a Pythonic developer experience. Pick Airflow unless you have a specific reason to pick the others.
What is a data contract?
A data contract is an explicit, versioned agreement between the team producing data and the teams consuming it: schema, semantics, freshness SLA, ownership, and allowed change patterns. It lives in code, is tested in CI, and breaks a build when violated. Contracts are the single most effective tool for eliminating the silent data quality incidents that kill ML models in production.
How do we test data quality?
Run structural tests (not-null, uniqueness, referential integrity, accepted values) on every pipeline run with dbt tests. Add semantic tests (row counts within expected ranges, distribution checks, freshness) with Great Expectations or Soda. Alert on violations and block downstream jobs for critical failures. Tests should be owned by the data producer, not the consumer.
How big does an ML team need to be to invest in data engineering?
Any ML team running more than two models in production benefits from dedicated data engineering investment. Below that, the overhead of a mature platform is hard to justify. Above three production models or three data scientists, the lack of data engineering becomes the single biggest blocker on velocity and reliability.
Planning AI work this quarter?
Book a 30-minute strategy call and we'll stress-test your use case before you commit.