← All writing
Star Schema6 min read · Blog

Use surrogate keys, not the source system's IDs

When you wire a fact table to its dimensions, you choose what key to join on. The obvious choice is the natural key — the customer ID from the CRM, the SKU from the catalog. It's right there in the source; why invent anything? Because that one shortcut quietly forecloses some of the most important things a dimensional model is supposed to do. The alternative — a surrogate key, a meaningless integer the warehouse assigns — looks like pointless indirection until you see what it buys.

A surrogate key is a warehouse-owned identity

A surrogate key is a column that exists only in your model: an auto-assigned integer with no business meaning, the primary key of the dimension, referenced by the fact. The natural key still lives in the dimension as an attribute — you don't throw it away — but the join happens on the surrogate. That one layer of indirection is the whole trick.

What it buys you

  • History tracking (the big one). Slowly changing dimensions of Type 2 require multiple rows for the same real-world entity — Texas customer, then New York customer. Each version needs its own key so a fact can point at the version that was true at the time. Natural keys can't do this; there's only one customer ID, so there's nowhere to hang the history. Surrogate keys make point-in-time truth possible.
  • Decoupling from the source. Re-platform the CRM and every natural key changes. If facts joined on those, your whole history breaks. Joined on surrogates, the warehouse is insulated — you remap once in the dimension and the billions of fact rows never notice.
  • Performance. A compact integer key joins and indexes faster than a long string or a composite natural key, and on large fact tables that adds up.
  • Handling the unknown. Late-arriving facts, nulls, and "not yet known" members get clean surrogate rows (a "-1 / Unknown" member) instead of null joins that silently drop rows from aggregates.

The cost is real but small

Surrogate keys aren't free: you maintain the key assignment, and during loads you look up the right surrogate for each incoming natural key (a step that gets interesting with Type 2 history and late arrivals). That's genuine ETL work. But it's bounded, well-understood, and it happens once in the pipeline — versus natural keys, whose costs show up unbounded and late, in the form of history you can't reconstruct and a re-platforming that takes the warehouse down with it.

The rule

Facts reference dimensions by surrogate key. Keep the natural key in the dimension as an attribute for traceability, but never join on it. It feels like ceremony on day one. It's the thing that lets the model track history, survive a source migration, and stay fast — which is most of why you built a star in the first place.

Also asListensoonSlidessoonPodcastsoonVideosoon

Have data that should be doing more?

Tell me about the pipeline that breaks, the metric nobody trusts, or the analysis stuck in a notebook. Let's operationalize it.