The semantic layer is the missing piece for trustworthy AI analysis
If you've watched an AI agent write SQL against a production warehouse, you've
seen the failure mode: it's fluent, not correct. It joins orders to
users on the wrong key. It sums amount without noticing half the rows are
refunds. It picks created_at when the business runs on closed_at. Every
answer arrives with the same calm confidence — including the wrong ones.
This isn't a model problem you fix with a bigger model. It's a grounding problem, and the fix is a semantic layer.
Why raw schemas break LLMs
A warehouse schema encodes almost none of what a human analyst knows. The
table is called fct_txn; the analyst knows it's deduplicated nightly, that
status = 'C' means complete, that revenue excludes tax and refunds, and that
"customer" means the billing account, not the user. None of that lives in the
DDL. The agent can't infer it — so it guesses, plausibly.
Give a capable model a clean, well-described model and the same agent gets sharp. Give it raw tables and column names from 2019, and you've built a very fast way to be wrong.
What the semantic layer actually provides
A semantic layer sits between the physical tables and anyone — human or agent — asking questions. It defines, in one governed place:
- Entities and grain — what a "customer," "order," or "session" is, and the level each table lives at.
- Metrics —
net_revenuedefined once, with its filters and exclusions, so it can't be re-derived three different ways. - Relationships — the join paths that are valid, so the agent can't invent one.
- Descriptions — the human context ("excludes internal test accounts") that the schema omits.
For an agent, that's not documentation — it's the action space. It stops the model from writing arbitrary SQL and lets it compose trusted, pre-defined building blocks. The agent reasons about which metric and which dimensions; it no longer gets to reinvent what "revenue" means at 2am.
Grounding beats cleverness
This is the through-line of everything I build: the model is downstream of the modeling. A mediocre LLM on a great semantic layer will out-analyze a frontier model on a pile of raw tables, every time — because the hard part of analysis was never the SQL syntax. It was knowing which question maps to which trustworthy number.
The semantic layer is how you encode that knowledge once and let everything — dashboards, notebooks, and now agents — inherit it. It's also the highest-ROI "AI project" most teams could run, and it has nothing to do with AI. It's data modeling. The agents just made it urgent.
Next in the series: how to actually expose that model to an agent so it can navigate it.