What Actually Breaks When You Put Azure AI Foundry Agents in Production ?

Demos lie. Your agent works perfectly when you’re showing it to stakeholders, then falls apart the first time a tool times out at 2am. I’ve been deploying Foundry agents into enterprise environments for a while now and the failures look nothing like what the documentation prepares you for. Sharing what we ended up doing about it.

The real failure mode

People expect agents to fail like APIs fail. They don’t. API failures are clean. The call works or it doesn’t. Agent failures are messy because they’re partial. Step three of a five-step workflow times out. The model hits a rate limit halfway through reasoning. A tool returns a malformed response that the agent then tries to interpret. The whole thing degrades sideways instead of breaking cleanly. This is the actual problem to solve. Not “how do I retry,” but “how do I make partial failures predictable.”

Classify before you retry

The biggest mistake I see teams make is wrapping every tool call in a generic retry block. This burns tokens, adds latency, and often retries things that were never going to succeed. We split failures into two buckets. Transient stuff like timeouts, throttling, and momentary unavailability gets retried. Deterministic stuff like malformed inputs, permission errors, and bad tool signatures does not. A bad permission error is not going to fix itself on the third try. This single distinction cut our retry overhead by more than half.

Bounded retries, not enthusiastic ones

Three retries max. 1s, 2s, 4s backoff. That’s it. Uncapped retries in an agent workflow are a trap. They make the system feel slower than just failing, and in a multi-step workflow the latency compounds in ways that make the whole experience unusable.

Circuit breakers belong at the tool layer

This is the architectural decision I’d make again every time. A lot of teams put resilience logic at the workflow level. We put it at the tool execution layer using Azure Functions for isolation. When a specific tool starts failing repeatedly, the breaker opens and subsequent calls fail fast instead of waiting for timeouts. The reason this matters: one bad downstream dependency shouldn’t freeze your entire agent. Isolating failures at the tool level means the agent can route around the broken thing instead of waiting on it.

Define your fallbacks ahead of time

When the circuit is open and retries are exhausted, the agent needs somewhere to go. Hoping the model figures it out is not a strategy. We defined three fallback paths and used them in this order: A reduced prompt with fewer tool dependencies. Often the request can still be satisfied with less information. A cached response if the query is deterministic and we have a recent result. Deferred processing for critical flows. The agent flags the request and we process it later. This one is underrated. Enterprise users will accept “I’ll get back to you on this” far more readily than a confidently wrong answer.

Skip the heavy frameworks

We used structured prompts and our own validation logic instead of LangChain or similar orchestration frameworks. This was deliberate. When you hand orchestration to a framework, you also hand it the decision-making about what happens during failures. In production that’s the last thing you want to give up. The framework’s defaults are tuned for general cases. Yours need to be tuned for your specific failure modes.

What I’d tell anyone going into production

Stop trying to build a system that doesn’t fail. Build one that fails predictably. Classify failures before retrying them. Put circuit breakers at the tool level. Define fallback paths explicitly. Treat deferred processing as a legitimate output, not a defeat.

The agents that survive in production aren’t the cleverest ones. They’re the ones that degrade gracefully when their dependencies don’t.

I’m a Solution Architect at IBM working on Agentic AI orchestration and MCP integrations. More posts on what actually works in production coming.