← All writing
Data Modeling6 min read · Blog

Declare the grain before you write a line of SQL

There's one sentence that decides whether a fact table will be trustworthy or cursed, and most teams never say it out loud: "One row in this table is exactly one ______."

That blank is the grain. It's the most important decision in the model, it takes five seconds to state, and skipping it is the root cause of more wrong numbers than any bug I've ever chased.

What grain means

The grain is the level of detail a fact table records — what a single row represents. "One row per order." "One row per order line." "One row per shipment per day." Each is a different grain, and they are not interchangeable. The moment you can't finish the sentence cleanly, you don't have a model. You have a pile of rows that look like data.

Declaring the grain first is backwards from how most people work. The instinct is to start joining tables and see what comes out. But the grain determines which joins are safe, which measures are additive, and what a SUM even means. Decide it last and you're reverse-engineering meaning from a query result — which is how fuzzy grain sneaks in.

The symptom: numbers that don't tie

Fuzzy grain announces itself the same way every time. Revenue is suddenly 1.8× too high. A count of customers exceeds the number of customers. A total changes depending on which filter you apply first.

The usual culprit is a fan-out join: you join an order-grain table to an order-line-grain table, every order row multiplies into one row per line, and now your order total is double-counted across its own line items. Nobody wrote a bug. The grain was just never pinned down, so two grains got multiplied together and the measures inflated.

Mixed grain is the silent killer

The dangerous version isn't an obviously broken table — it's a table where the grain is mostly consistent. One row per order... except refunds, which are one row per line. Or daily snapshots... except the first day, which is cumulative. These tables pass a glance and fail under aggregation, and because they're right 90% of the time, the wrong 10% gets trusted.

The fix is discipline, not cleverness:

  • State the grain in the table's description. One sentence, no "usually."
  • Keep one grain per fact table. Need order-level and line-level? That's two tables, joined on a key — not one table doing both.
  • Check additivity against the grain. A measure that's safe to SUM at one grain may only be averageable at another. Know which before you put it on a dashboard.

Say the sentence

Before you write SQL, finish the sentence: one row is exactly one ____. If you can, the joins, the measures, and the metrics all fall out of it cleanly. If you can't, stop — you've found the real work, and it's cheaper to do now than to explain a wrong number to a VP later.

Grain decides what a row means. The next decision — how attributes change over time — decides whether your history stays honest.

Also asListensoonSlidessoonPodcastsoonVideosoon

Have data that should be doing more?

Tell me about the pipeline that breaks, the metric nobody trusts, or the analysis stuck in a notebook. Let's operationalize it.