The agent that dazzles in a demo stumbles in production. How we wire tool calls, state, and cost control in the field.
An agent is not one prompt
An AI agent is a loop of planning, tool calls and memory. A demo needs one good run; production needs hundreds of different paths to all end safely. Our first decision is always the same: keep the number of tools minimal. Few well-defined tools beat many fuzzy ones every time.
Bound the loop
Infinite loops and "tool thrashing" are the two most expensive failures. We give every agent a hard step limit, a per-step budget and a clear give-up condition. When a step fails the agent should not blindly retry — it should hand off to a human in a controlled way.
Observability is mandatory
You cannot run a production agent without logging each step’s input, the tool it chose, the tool’s output and the total token cost. We collect every agent run under a single trace, so when a bug report arrives the record talks, not a guess.
The eval pipeline
When you change a model or a prompt, answer "did it get better?" with a number, not a feeling. Without a curated eval set drawn from real production traffic, every update is a gamble. Scaling an agent really means making it measurable.