← All writing
Data Engineering9 min read · Blog

The hidden cost of untyped pipelines

Schema drift is a tax you pay in 3am pages. An upstream team renames a column, a type silently widens from integer to float, a nullable field starts arriving null for real — and three dashboards downstream begin lying, confidently, for a week before anyone notices. No error fired. The pipeline ran green the whole time. That's the part that makes it expensive: untyped pipelines don't break loudly, they drift quietly.

Why untyped is the default

Nobody chooses untyped pipelines on purpose. They're what you get for free. You read a JSON blob, pluck the fields you need, write them somewhere. It's fast, it works, and on day one the data looks exactly like you expect. The cost is deferred, which is the most dangerous kind of cost — it accrues silently and comes due during an incident, when it's most expensive to pay.

The deferral is the trap. By the time drift bites, the person who wrote the pipeline has moved on, the assumptions live only in their head, and the only documentation of what the data was supposed to look like is the broken dashboard.

What you're actually paying for

  • Silent wrong answers. The worst outcome in data isn't downtime — it's a number that's confidently wrong and acts-upon. Untyped pipelines specialize in producing exactly that.
  • Debugging time. Tracing a bad metric back through five untyped hops to the column that changed shape is hours of archaeology, usually under pressure.
  • Eroded trust. Each silent-drift incident teaches the business to distrust the numbers a little more. Trust is the entire product of a data team, and this is how it bleeds out.

Contracts turn drift into a caught error

A data contract is a simple promise made explicit: this dataset has these fields, these types, these constraints. Typed transformations carry that promise through each stage. The point isn't bureaucratic correctness — it's changing where the failure happens. Without a contract, drift surfaces as a wrong number in a board deck. With one, it surfaces as a failed check at the boundary, with a message that names the column and the violated expectation. Same drift; a loud, located error instead of a silent, smeared one.

Where to add typing first

You don't need to type the whole graph at once — that is boiling the ocean. Add it where the leverage is highest:

  • At the seams between teams. The ingestion edge — where data crosses from a system you don't control into one you do — is where most drift originates and where a contract buys the most.
  • In front of anything load-bearing. The tables your headline metrics depend on. Type their inputs first.
  • At the grain boundary. Where rows get joined or aggregated, a type mismatch becomes a fan-out or a double-count. Guard those joins.

Fail loud, then expand

Roll it out in two moves. First, make existing assumptions explicit and enforced — add checks that fail the run when the shape changes, so drift can't pass silently. Then expand coverage outward from the load-bearing core as incidents (or near-misses) show you where the next contract belongs.

Contracts feel like overhead right up until the first incident they convert from a week-long mystery into a five-minute fix. After that, they feel like the cheapest insurance you've ever bought — and they usually pay for themselves inside a quarter.

Also asListensoonSlidessoonPodcastsoonVideosoon

Have data that should be doing more?

Tell me about the pipeline that breaks, the metric nobody trusts, or the analysis stuck in a notebook. Let's operationalize it.