Skip to main content

Command Palette

Search for a command to run...

Harness Engineering: The Part of Agentic AI Nobody Writes About

Updated
4 min read
Harness Engineering: The Part of Agentic AI Nobody Writes About
K
Solution Architect at IBM with 12+ years in enterprise software, cloud, and applied AI. I write about what production agentic AI actually looks like, the parts that don't fit the marketing. Senior Member, IEEE.

Everyone is tuning prompts and picking models. Almost nobody is talking about the harness. And after building agentic systems for enterprise customers over the last year, the harness is where most of my production problems actually live.

This isn't about prompt engineering. There's enough of that already. This is about the scaffolding around the model: how you assemble context, manage tools, handle retries, and catch what comes back. The model is the engine. The harness is everything that feeds it and everything that cleans up after it.

The short version

Your model is rarely the problem. How you feed it and how you handle its output is. Teams spend weeks tuning prompts when the real fix was upstream in the harness the whole time.

Context is the thing people underestimate

Everyone talks about prompts. Context matters more.

What I see with junior engineers is they keep tweaking the prompt when they don't get the answer they want. They'll ask the same thing five different ways. Most of the time the prompt isn't the problem. The model just didn't have enough context to work with.

That's why requirements, specs, business rules, instruction files, and knowledge bases matter more than the prompt does. The model can generate code all day. It can't know how your business works unless the harness puts that in front of it.

Where the harness breaks

Context bloat. You stuff in too much. Full history, every tool definition, whole documents. You burn tokens and the output gets worse, not better. More context is not better context.

Stale context. You feed the model a snapshot, the underlying data changes, and nobody re-feeds it. The harness has no sense of freshness, so the model is reasoning over a world that no longer exists.

Tool overload. Expose fifty tools and the model picks wrong. The tool definitions eat your context window before the real request even lands.

The mistake everyone makes

Teams treat the harness as plumbing. Boring code you write once and forget. So they pour all their effort into the prompt and the model choice, and leave the harness as an afterthought.

Then it breaks in production, and they go right back to tuning the prompt, because that's the part they've been taught to think about.

The prompt was never the problem.

What I'd build today

Treat context as a first-class design problem, not a string you concatenate. Decide what the model needs, when it needs it, and how it stays fresh. Keep the tool surface small and relevant per call. Log what you actually sent the model, not just what it sent back, so you can debug why it did what it did. And design for failure inside the loop, because a single tool call failing is easy. A multi-step run failing halfway is where real systems fall apart.

None of this is exciting. It's also where the leverage is.

Closing thought

We spent the last two years getting good at prompting. The next two are about getting good at the harness. The teams that win won't be the ones with the cleverest prompts. They'll be the ones who did the boring work of capturing their context and feeding it well.

The code still matters. But so do the specs, the instruction files, and the workflows that tell the model what it's supposed to do. That's where the real work is moving.


Karthik Karunanithi is a Solution Architect at IBM. Writing about what production agentic AI actually looks like in enterprise environments, including the parts that don't fit the marketing.

K

The autonomy boundary is the part people underestimate most. Too timid and it's useless, too loose and it's taking actions nobody signed off on. And that's not a model setting, it's a judgment call someone has to deliberately encode.

The "approval gates that exist on paper but never halt execution" one hits home. Seen that exact thing, looks governed, isn't.

Thank you and Appreciate the thoughtful read. This is the conversation I wanted the post to start.

A
AskReal5d ago

This is the conversation that actually needs to happen. Everyone optimizes the 20% (model + prompt) and ignores the 80% (harness) — and then wonders why their agent works in demos but fails in production.

The harness is where all the real decisions live: how context is managed across turns, when to stop and ask for human input, how errors surface, what gets retried vs. escalated. These aren't model problems, they're systems design problems. And most teams aren't treating them that way.

What I've seen break repeatedly in enterprise deployments:

  • Context window mismanagement (stuffing too much, losing critical state)
  • No graceful degradation when a tool call fails mid-chain
  • Approval gates that exist on paper but never actually halt execution
  • Logging that tells you what happened but not why the agent made a particular branch decision

The harness is also where you encode domain judgment — what the agent is allowed to do autonomously vs. what requires a human decision. Get that boundary wrong and you either build an agent that's too timid to be useful or one that takes consequential actions no one intended to delegate.

Solid framing. This deserves more attention than another "which model wins the benchmark" post.

AI Systems in Production

Part 1 of 6

This series covers what actually happens when AI systems move from demo to production - agent workflows, LLM behavior, failure modes, and the architectural decisions that make systems reliable at scale.

Up next

Foundry vs Semantic Kernel vs AutoGen : What I Actually Use in Production

There are three real options for building agentic AI on the Microsoft stack right now. Azure AI Foundry Agent Service, Semantic Kernel, and AutoGen. Pick the wrong one and you'll spend three months re