Based on Agentic Design Patterns by Antonio Gulli (Springer). All book royalties go to Save the Children.

Key Takeaway
You can't improve an AI agent you can't measure. Monitoring and evaluation combines observability (tracing every step, tool call, token, and millisecond), offline and online evals, and the metrics that matter — task success, latency, cost per task, and groundedness — to decide what the agent should optimize for and which failure to fix first.
Why This Matters for Enterprise AI
Most teams ship an agent the way they'd ship a feature: build it, demo it, push it, move on. Then it starts failing in ways nobody can explain. A support agent quietly hallucinates a refund policy. A research agent burns forty tool calls to answer a question that needed three. Cost doubles one Tuesday and no one notices until finance asks. The agent is a black box, and the team is flying blind.
The fix here is not a sharper prompt, it is measurement. You cannot improve what you do not observe, and you cannot decide what to fix without knowing what is breaking and how often. Monitoring and evaluation is the discipline that turns an agent from a thing you hope works into a system you can reason about. It is also the pattern that makes every other pattern in this series safe to run in production, because reflection and adaptation only helps if you can tell whether the adaptation made things better or worse.
What Is Agent Monitoring and Evaluation?
Agent monitoring and evaluation is the practice of instrumenting an agent so every run is observable, then scoring those runs against defined criteria to decide what to improve. Antonio Gulli, in Agentic Design Patterns, treats it as the operational backbone of any serious deployment: observability tells you what happened, evaluation tells you whether what happened was good, and prioritization tells you which gap to close first.
Three distinct jobs hide inside that one phrase, and conflating them is where teams go wrong.
- Observability is the raw record. Every step, tool call, token, and millisecond of latency, captured so you can replay any run after the fact.
- Evaluation is the judgment. A score or label applied to a run, measured against criteria you set in advance. Did it succeed, was it grounded, did it stay on budget.
- Prioritization is the decision. Given a backlog of failures, which one costs you the most, and which do you fix this sprint.

A model you can prompt but not measure is a demo. A model you can measure, score, and triage is a product.
Observability: Tracing the Agent Run
Before you can evaluate anything, you have to see it. For a single LLM call, a log line is enough. For an agent that plans, calls tools, retrieves context, and loops, you need a trace: a structured, hierarchical record of everything the agent did to produce an answer.
A good trace captures four things at every step:
- Steps and reasoning. The sequence of decisions the agent made, including which branch it took and why. When an agent goes off the rails, the trace shows the exact step where it lost the plot.
- Tool calls. Every function the agent invoked, the arguments it passed, and what came back. A surprising share of agent failures are not reasoning failures at all. They are a malformed tool argument, or an API that returned an error the agent ignored.
- Tokens. Input and output token counts per step, because tokens are the unit cost of an agent and the thing that silently inflates the bill.
- Latency. Wall-clock time per step and end to end, so you can find the slow tool or the redundant call that makes the agent feel sluggish.
Tools like LangSmith, Arize Phoenix, and the OpenTelemetry GenAI conventions exist to capture exactly this. Which tool you pick matters less than having one at all. Without a trace, every debugging session is a guess, and every "it works on my machine" is unfalsifiable.
Enterprise reality: When a customer says your agent gave them wrong information, "we'll look into it" is not an answer your support lead can use. A trace is. You pull the exact run, see that the agent retrieved a stale document, flagged the wrong clause, and never called the verification tool, and you fix the retrieval step. Without the trace, you are reduced to re-prompting and praying.
Evaluation: Offline and Online
Observability tells you what happened. Evaluation tells you whether it was any good. It comes in two flavors, and a mature team runs both.
Offline Evals: The Regression Suite
An offline eval runs your agent against a curated test set (a fixed collection of inputs paired with known-good expectations) in a controlled environment, before anything ships. This is the agentic equivalent of a unit-test suite, and it serves the same purpose: catch regressions before users do.
Build the set from real failures. Every time the agent breaks in a way that matters, capture that input, write down what the right behavior would have been, and add it to the suite. Over time you accumulate a regression battery that any change has to pass before it merges. Tighten a prompt, swap a model, adjust a retrieval step, then run the suite and you know in minutes whether you broke something that used to work.
The honest limitation: an offline set is a snapshot. Real users are inventive, and the long tail of production traffic will always contain inputs your curated set never imagined. Offline evals catch known failure modes fast and cheaply. They cannot catch the unknown ones.
Online Evals: Watching Real Traffic
An online eval scores real production traffic as it happens. This is where you catch the failures your test set missed: the novel inputs, the slow drift in quality after a model update, the edge cases that only show up at volume.
You rarely score every production run; that gets expensive. Instead you sample, or trigger evaluation on signals that suggest trouble: a user who rephrases the same question three times, a thumbs-down, a run that blew past its tool-call budget. Online evals are how you measure the agent against reality instead of against your assumptions about reality. They are also how you feed resource-aware AI agents the cost-and-latency data they need to decide when a cheaper model or a shorter path will do.
The two are complementary, not competing. Offline evals give you a fast, deterministic gate before deploy. Online evals give you ground truth after. Skip the offline set and every deploy is a gamble; skip the online layer and you are blind to everything your test set didn't predict.
The Metrics That Earn Their Place
You can measure a hundred things about an agent. Four of them belong on the dashboard.
- Task success rate. The percentage of runs that accomplished the user's goal. This is the north-star metric, and it is the one teams most often skip because it is the hardest to define. Defining "success" for your specific agent is most of the work, and it is work worth doing, because every other metric is secondary to whether the thing did its job.
- Latency. How long a run takes, end to end. An agent that gives a perfect answer in ninety seconds has lost the user who needed it in five. Track the median and the tail; the slowest 5 percent of runs are often where the worst experiences live.
- Cost per task. The total token and tool spend to complete one task. This is the number that decides whether your agent is a viable product or an expensive science project, and it is invisible unless you measure it per task rather than in aggregate.
- Hallucination and groundedness. How often the agent asserts something unsupported by its sources or tools. For any agent that retrieves information or touches a system of record, this is the difference between a useful product and a liability. Groundedness, meaning whether every claim is traceable to a retrieved source, is the measurable version of "did the agent make this up."
Notice the tension built into this list. The cheapest, fastest agent is rarely the most accurate, and the most accurate is rarely the cheapest. Monitoring exists so you can see the trade-off you are actually making instead of the one you assume you are making.
LLM-as-Judge: Scoring the Open-Ended
Task success rate is easy to compute when the answer is a number or a category. It is hard when the answer is a paragraph. Was this summary faithful? Was this support reply helpful and on-brand? You cannot regex your way to that judgment.
The increasingly standard approach is LLM-as-judge: use a second, often stronger, language model to score the first model's output against a rubric. You hand the judge the input, the agent's output, and a clear set of criteria, and it returns a score and a reason. It scales to the open-ended outputs that defeat exact-match checks, and it is the engine behind most online groundedness and quality evals today.
# Abbreviated — an LLM-as-judge groundedness scorer, not production code
import json
from anthropic import Anthropicclient = Anthropic()
JUDGE_RUBRIC = """You are evaluating an AI agent's answer for groundedness.
Score 1-5: is EVERY claim in the answer supported by the provided sources?
5 = fully grounded, 1 = unsupported or fabricated.
Return strict JSON: {"score": <int>, "reason": "<one sentence>"}"""
def score_groundedness(question: str, answer: str, sources: str) -> dict:
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200,
system=JUDGE_RUBRIC,
messages=[{
"role": "user",
"content": (
f"Question:\n{question}\n\n"
f"Sources:\n{sources}\n\n"
f"Agent answer:\n{answer}"
),
}],
)
return json.loads(msg.content[0].text)
# Run across a sampled batch of production traces, then track the
# average score over time to catch quality drift after a model change.
The framing is appealing, but the caveats are real and you ignore them at your peril. An LLM judge inherits the biases of the model behind it. It tends to favor longer answers, it can prefer outputs that match its own style, and it can be lenient on exactly the subtle errors you most need it to catch. The discipline is to treat the judge as an instrument that itself needs calibration: validate it against a set of human-labeled examples, measure how often it agrees with your experts, and only trust its scores once that agreement is high enough to bank on. A judge you have not validated is a confident number with no ground truth underneath it. LLM-as-judge scales human judgment; it does not replace the need to establish it first.
Goal Setting and Prioritization: Deciding What to Fix First
Measurement is not the goal. Better decisions are. Once you can see and score your agent, two questions decide whether all that instrumentation pays off.
What should this agent optimize for? An agent cannot maximize accuracy, speed, and cost at once, because those goals pull against each other. A medical-summarization agent should optimize for groundedness even if it costs more and runs slower; nothing matters more than not making things up. A high-volume classification agent should optimize for cost per task, because a small per-run saving compounds across millions of runs. Naming the primary metric is a goal-setting decision, and it belongs to the business, not the model. Get it wrong and you will tune relentlessly toward the wrong target.
Which failure do you fix first? Your eval results will hand you a backlog of failure modes. Resist the urge to fix the one that annoys you most. Prioritize by impact: how often does this failure happen, and how much does each occurrence cost you in dollars, in user trust, in risk? A hallucination that hits 2 percent of runs but exposes you to compliance liability outranks a formatting quirk that hits 30 percent but bothers no one. This is also where monitoring earns its keep alongside safety: the failures your evals surface are precisely the ones your AI guardrails and safety layer should be catching at runtime, and the two systems sharpen each other.
Enterprise reality: A team that prioritizes by failure frequency alone will spend a quarter polishing the most common annoyance while a rare, expensive failure quietly drains the budget or invites a lawsuit. Prioritize by frequency times cost, not by frequency alone. The whole reason you measured cost per task and groundedness separately is so this decision is grounded in numbers instead of whoever argues loudest in the standup.
When to Invest in Evaluation (and How Much)
Evaluation is not free, so match the rigor to the stakes.
- A throwaway internal prototype needs little more than eyeballing the traces. Building a regression suite for something three people use once is over-engineering.
- A customer-facing agent that touches money, health, or legal exposure needs the full stack: tracing, an offline regression suite, online monitoring, and validated LLM-judge scoring. Here, under-investing is the expensive choice.
- Most agents live in between. Start with tracing, which is cheap and pays for itself the first time you debug a real failure. Add an offline suite as soon as you have a handful of failures worth not repeating, then add online evals once you have real traffic to watch.
The honest test: if you cannot answer "how often does this agent succeed, and what does a run cost," you are not ready to scale it. Measure first, scale second.
Key Takeaways
- You cannot improve an agent you cannot measure. Monitoring and evaluation is the operational backbone that makes every other agentic pattern safe to run in production.
- Observability is the trace: steps, tool calls, tokens, and latency, captured so you can replay and debug any run instead of guessing.
- Run both kinds of eval: offline evals (curated test sets and regression suites) catch known failures cheaply before deploy; online evals score real production traffic to catch the failures your test set never imagined.
- Watch four metrics: task success rate (the north star), latency, cost per task, and hallucination/groundedness. They trade off against each other, so measure the trade-off instead of assuming it.
- LLM-as-judge scales scoring of open-ended outputs, but it inherits model bias and must be validated against human labels before you trust it.
- Prioritize fixes by frequency times cost, not frequency alone, and set one primary metric per agent so you optimize toward the goal the business actually cares about.
Previous in series
Exception Handling and Human-in-the-Loop: Making AI Agents Resilient
Next in series
Resource-Aware AI Agents: Optimization and Exploration Strategies
Is your site invisible to AI search?
Get a free AEO infrastructure audit and find out what your competitors are doing that you're not.
Get Your Free AuditFurther Reading
Industry sources we cite.
3 links · External
Frequently asked.
Continue with.
Agentic AI
Reflection and Adaptation: How AI Agents Learn From Their Own Output
Reflection is the pattern where an AI agent critiques its own output and revises it, looping until the work clears a quality bar. It is the self-correction loop behind reliable agents.
Agentic AI
Resource-Aware AI Agents: Optimization and Exploration Strategies
Resource-aware AI agents budget tokens, route tasks by difficulty, cache results, and stop early. The patterns that separate a viable product from a money pit.
Agentic AI
AI Guardrails and Safety: Building Trustworthy Agentic Systems
AI guardrails are the input, output, and permission controls that keep an agent safe in production: what separates a demo from an enterprise deployment.