Foundations Before Agents: The Unsexy Work That Decides Whether AI Works

Almost every failed AI project I've been called in to look at has the same shape. The agent works in the demo. It works on the founder's laptop. It works on the three cherry-picked examples in the pitch deck. Then it goes live, touches the real business, and starts quietly producing answers that are subtly — or spectacularly — wrong.

Nobody wants to hear that the problem isn't the model. The model is the exciting bit. The problem is almost always what is underneath it or what is wrapped around it: in the data the agent is reading, the systems it's reaching into, and the scaffolding nobody built to run it safely or provide more relevant answers.

In the pillar piece on the three fates of AI adoption, I described the "surfing" business as one that rebuilt its foundations before the agents went live. This is the post about what that actually means — written for the business owner who keeps hearing the word "foundations" and isn't sure what's in scope.

The same supplier, the same quote — and what's actually broken

Take the B2B supplier from the pillar piece. An enquiry comes in. A rep needs to produce a quote. To do that, three things have to be true:

There is one answer to what this product is and what it costs.
There is a real-time answer to whether it's in stock.
There is one version of who this customer is and what they've bought before.

In most real-world, messy businesses, the so-called source of truth can overlap everywhere. There are three product lists with overlapping SKUs. The stock figure in the ERP lags the warehouse by a day. The customer exists three times in the CRM under slightly different names. A human rep handles this by knowing which version to trust — they've worked there for eight years.

The moment you put an AI agent in that rep's seat, the workaround stops working. The agent doesn't know which file is authoritative. It will confidently pick one, quote off it, and send the email. And because the output looks polished, nobody catches the error until the customer does.

This is the foundation problem in one sentence: AI removes the human who was silently absorbing your data quality issues.

The foundation isn't one thing — it's three layers

When people say "foundations," they usually mean "clean up the data." That's one layer, and it's the one everyone fixates on. But a foundation that an agent can actually run on top of has three distinct layers, and most stuck projects are missing the second or third entirely — which is exactly why they never escape the testing phase.

Layer 1: Data — what the agent reads from

The raw material. One source of truth for the entities your business runs on, records that have been de-duplicated and reconciled, documents the agent can actually search. If this layer is wrong, the agent is confidently wrong, because it's reasoning over bad inputs.

This is the layer everyone understands, so it's the one that gets the budget. It's necessary. It is nowhere near sufficient.

Layer 2: Systems — what the agent acts through

Data sitting in a clean database doesn't do anything. The work happens in your ERP, your CRM, your warehouse system, your inbox — and those systems change minute to minute. This layer is the live, governed connection between the agent and the operational tools where work actually flows: the ability to read current stock, write a quote, update an order, with the same access controls a human user would have.

An agent that can reason but can't do anything is a chatbot. An agent that can do everything because nobody scoped its access is a liability. Layer 2 is getting this boundary right.

Layer 3: The harness — what makes the agent safe to run

This is the layer almost nobody plans for, and it's the one that decides whether a working demo ever becomes a working system. The harness is the scaffolding around the agent:

Orchestration — managing multi-step tasks, retries, and hand-offs, so the agent runs reliably and not just once in a notebook.
Observability — logging what the agent was asked, what data it pulled, and what it produced, so you can debug a bad output instead of guessing.
Evaluation — a way to measure quality before and after every change, so you can actually exit the testing phase with confidence rather than hoping.
Guardrails and human-in-the-loop — limits, escalation paths, and checkpoints, so a wrong answer gets caught before it reaches a customer.

Without this layer, you have a clever prototype that no one trusts with anything that matters. That is the precise description of an AI project stuck in "testing."

The agent sits on top of all three. It is the last thing you build, not the first. A business that gets this right has done the unglamorous work across all three layers before a single agent goes live — and the next section is what each layer looks like in practice.

What "the foundation" actually consists of

The three layers above are the mental model. Here's what they look like in practice — seven concrete things, grouped by the layer they belong to. None of them are glamorous. All of them are load-bearing.

Data layer — what the agent reads from

1. A single source of truth for the entities your business runs on. Products, customers, prices, locations, employees. Pick the handful of entities your workflows actually depend on. For each one, there needs to be one system that owns the answer, and every other system either reads from it or is reconciled against it. This is called master data management, and it has been unfashionable for twenty years. AI just made it urgent again.

The test: if you ask three people in your business "what's the price of product X for customer Y?", do you get one answer or three?

2. Entity resolution — knowing when two records are the same thing. "ACME Industries", "Acme Inds.", and "ACME Industries Pty Ltd" are the same customer. A human knows this. Your CRM does not. Until something — a deterministic rule, a fuzzy match, an embedding-based dedupe pass — collapses those into a single canonical record, every agent that reaches into customer history will see a partial view and act on it.

3. A document and knowledge layer the agent can actually search. Your product documentation, policies, contracts, and SOPs are probably scattered across SharePoint, Google Drive, email threads, and someone's desktop. A surfing business has consolidated those into a searchable corpus — typically a vector database — so that when an agent answers a customer question, it's grounding the answer in your own material rather than guessing from its training data.

The test: can you, today, ask a single search box "what's our return policy for international orders over $10,000?" and get the answer from the actual policy document? If not, neither can your agent.

Systems layer — what the agent acts through

4. Live, governed access to operational systems. Stock levels, pricing tiers, customer history, order status. These live in real systems — your ERP, your CRM, your warehouse management system — and they change minute to minute. An agent that reads a nightly export is an agent that quotes yesterday's stock. The foundation is the integration layer that lets agents query these systems live, through a defined interface, with the access controls a human user would have.

This is where MCP servers, API gateways, and role-based access control come in. Not because they're trendy, but because without them your agent either can't see the data at all, or can see everything — including the things it shouldn't.

The test: can the agent read current stock and write a real quote through a defined interface — or is it working off an export and handing the result to a human to re-key?

Harness layer — what makes the agent safe to run

5. Observability — knowing what the agents are doing. Every agent action needs to be logged: what it was asked, what data it pulled, what it produced, what a human did with the output. Without this, you can't debug a bad quote, you can't improve the agent, and you can't prove to a regulator (or a customer's lawyer) what happened. This is the part teams skip first and regret most.

6. Evaluation — a way to measure quality before and after every change. This is what lets you exit the testing phase instead of living in it. You need a set of real cases with known-good answers, run automatically, so that when you change a prompt, a model, or a data source, you can see whether quality went up or down. Without it, every tweak is a gamble and "is it working?" is a matter of opinion.

The test: if you swapped the underlying model tomorrow, would you know — with numbers, not vibes — whether your agent got better or worse?

7. Orchestration and guardrails — the rails the agent runs on. Multi-step tasks need managing: retries, hand-offs, and a defined stopping point. And every consequential action needs a limit, a checkpoint, or a human-in-the-loop escalation, so a wrong answer gets caught before it reaches a customer. This is the difference between an agent you can let near the business and a clever prototype nobody trusts.

The full stack, layer by layer

When your team or a vendor shows you an architecture diagram, it will have far more than three boxes on it. That's not a contradiction — the three layers aren't a different model, they're the same engineering stack grouped the way a business owner needs to reason about it. Here's the fuller picture the practitioners are working from:

The AI engineering stack: data foundations and retrieval at the base; models, agent architecture, integration and human interface in the middle; evaluation, observability, safety and governance as cross-cutting concerns; strategy on top and AI DevOps underneath

Don't try to memorise it. Every box lives inside one of the three layers: Data at the base, Systems in the middle, and the Harness wrapping everything else — with the agent itself just one box near the top, added last. Here's each layer in plain English, from the ground up.

1. Data foundations

This is where most projects should start, and where most businesses underinvest. Before anything clever happens, you need real answers to five questions: what data exists, where it lives, who owns it, how good it is, and how often it updates. The unglamorous plumbing — a catalogue of what you have, pipelines that land it somewhere you can actually query, and proper handling of personal data — is the load-bearing part. Vector stores and knowledge graphs sit here too, but they come after the basics, not instead of them.

Data foundations layer — data catalog, pipelines, quality, and lineage

2. Retrieval & knowledge

Once the data is reachable, the question becomes how the agent finds the right piece at the right moment. That's retrieval: how documents get broken up, searched, and ranked before the model ever sees them. Most people reach for "train a custom model" when better retrieval would do the job faster and for less money. If your agent is grounding its answers in your actual material rather than guessing, this layer is the reason.

3. Models & orchestration

Which model does the work, and how you manage it. You rarely want one model for everything — a cheap one can triage the easy cases and an expensive one can handle the hard ones, which keeps cost under control. The discipline that matters here is treating prompts like code: kept in version control, reviewed, and tested, not quietly edited in a text box and forgotten.

Models and orchestration layer — model selection, routing, prompt management, and caching

4. Agent architecture

This is the agent itself — how it uses tools, what it remembers, how it plans a multi-step task, and where it stops to check with a human. Getting an agent to work once in a demo is easy; the hard part is making it reliable across the messy long tail of real-world inputs. The rule of thumb that separates working systems from science projects: start simple, and only add complexity when your testing proves it's needed.

Agent architecture layer — tools and function calling, memory, planning, and state

5. Integration & actions

How the agent actually does things in your systems — produces the quote in the CRM, updates the order in the ERP, raises the ticket. The critical and most-overlooked piece is permissions: an agent acting on someone's behalf should inherit that person's access, not run with the keys to everything. This is the layer where security review tends to slow a rollout down — which is a feature, not a bug.

Integration and actions layer — API gateways, MCP servers, connectors, authentication and identity

6. Evaluation & observability

The most undersold part of the whole stack, and the one that decides whether you ever leave "testing". Evaluation is a repeatable way to measure quality; observability is being able to see what the agent did and why. Without them you can't safely improve anything, ship with confidence, or prove the thing is working. Insist on this early — it becomes the spine everything else hangs off.

Evaluation and observability layer — golden datasets, regression suites, production tracing, cost and latency dashboards

7. Safety, governance & compliance

The guardrails: defences against manipulated inputs, filtering of bad outputs, an audit log of every action the agent takes, redaction of personal data, and alignment with whatever rules your industry runs under — GDPR, SOC 2, HIPAA. In regulated sectors this isn't a final coat of paint; it often dictates how the whole system has to be built in the first place.

Safety, governance and compliance layer — prompt-injection defence, output filtering, audit logs, PII redaction, and regulatory alignment

8. AI DevOps & lifecycle

The same engineering discipline ordinary software has had for years, applied to AI: separate development, staging, and production environments, gradual rollouts, feature flags for AI features, and a way to roll back when something regresses. Most organisations are well behind here — which is exactly why an agent that worked yesterday can quietly start misbehaving today with no one noticing.

AI DevOps and lifecycle layer — CI/CD for prompts and agents, environment separation, canary rollouts, feature flags, and rollback

9. Human interface

How the output actually reaches a person — answers that stream as they're generated, citations back to the source, honest confidence signals, graceful handling of failure, and an easy way to give feedback or corrections. It gets dismissed as polish, but it's where trust is won or lost. A right answer presented badly still won't get used.

Human interface layer — streaming responses, citations, confidence signals, graceful failure, and feedback capture

10. Strategy & use-case selection

The wrapper around everything else: choosing which use case to tackle first (high value and actually feasible), deciding what to build versus buy, and managing the organisational change that determines whether anyone uses what you ship. This is why a proper discovery phase comes first — most businesses ask for "an agent" when what they actually need is three-quarters data plumbing before an agent makes any sense.

Strategy layer — use-case prioritisation, build-versus-buy, vendor lock-in, and change management

Why this gets skipped — and why skipping it is expensive

The reason foundations get skipped is structural, not lazy. Foundation work has three properties that make it hard to fund:

It produces no visible output for weeks or months.
It doesn't have a demo.
The person who benefits from it is the next project, not the one paying for it.

Meanwhile, a chatbot proof-of-concept can be standing up by Friday and shown to the board on Monday. The incentives push every organisation toward the demo and away from the plumbing.

The cost of that trade-off doesn't show up immediately. It shows up six months in, when the third agent goes live and the team realises every new agent is paying the same integration tax, hitting the same data quality issues, and producing the same percentage of subtly wrong outputs. The business has now spent more on agents than it would have spent on the foundation — and still doesn't have the foundation.

This is the ceiling I described in the pillar piece. It's not a ceiling on what AI can do. It's a ceiling on what your AI can do, given what's underneath it.

What good sequencing looks like

The sequencing that actually works, in my experience, is roughly:

Pick one workflow — the highest-volume one, where the economics are obvious. Quote-to-cash, support triage, onboarding, whatever it is.
Map the data and systems that workflow depends on. Not the whole business — just this workflow. Which entities, which systems, which documents.
Fix the foundation for that workflow only — across all three layers. Data: master data for the entities involved, and a searchable corpus of the documents. Systems: live, governed access to the operational tools the workflow touches. Harness: logging, an evaluation set to measure quality, and guardrails for anything consequential.
Then build the agent. It will work. It will keep working. And — critically — the foundation you built will be reusable for the next workflow.

This is the opposite of how most businesses are doing it, which is "buy ten tools, hope they compose." They don't compose. They sit alongside each other, each with its own partial view of the business, each producing its own confident-but-wrong outputs.

The uncomfortable question

If you're a business owner and an internal team or vendor is pitching you an AI project, the question to ask isn't "what model are you using?" or "how good is the agent?" Those are the easy questions.

The hard question is: "What does this agent need to be true about our data, our systems, and the harness around it to work — and which of those things are actually true today?"

If the answer is a shrug, or a confident "we'll handle that as we go," you're being sold the demo. The foundation work will land on you later, at a worse time, after the agent has already produced enough wrong outputs to damage trust internally.

If the answer is a specific list of entities, systems, and documents, and a plan to address each one before the agent goes live — that's the conversation that produces a surfing business.

So where do you start?

If you're not sure whether your foundations are ready for AI, the diagnostic is shorter than you'd think. Pick one workflow. Walk through it. At every step, ask: which system is the source of truth, and is it right? The gaps will be obvious within an hour.

That's the conversation I have with clients on a free 30-minute strategy call — picking the workflow, mapping what's underneath it, and being honest about what needs to be true before an agent should touch it. Book a time here.

AI StrategyData FoundationsWorkflow Automation