Modern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spendModern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spendModern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spendModern websites, ranked in AI searchCited by ChatGPT, Perplexity & Google AI OverviewsLower than your current SEO spend
Agentic AI

Reflection and Adaptation: How AI Agents Learn From Their Own Output

Reflection is the pattern where an AI agent critiques its own output and revises it, looping until the work clears a quality bar. It is the self-correction loop behind reliable agents.

Space & Story Team·June 15, 2026·10 min read
reflection patternself-correction AIevaluator-optimizeragentic design patternsAI agent qualityAntonio Gulli

Based on Agentic Design Patterns by Antonio Gulli (Springer). All book royalties go to Save the Children.

Space & Story Team·June 15, 2026·10 min read
Reflection and Adaptation: How AI Agents Learn From Their Own Output

Key Takeaway

Reflection is the self-correction pattern, where an AI agent critiques its own output and revises in a generate, critique, revise loop until the work clears a quality bar. It trades extra cost and latency for reliability, and it pays off most on code, math, and complex writing, as long as you set stopping criteria so the loop never runs forever.

Why This Matters for Enterprise AI

A junior analyst hands you a draft. You read it, mark the three weak spots, and hand it back. They fix those spots and the second draft is the one that ships. That review-and-revise step is so normal in human work that nobody calls it a pattern. In agentic AI it has a name, it carries real weight, and most teams skip it anyway.

The default move is to take whatever the model returns on the first pass and ship it. First drafts are where models are weakest. They hallucinate a citation, miss an edge case, leave a bug in the third function. Reflection is the fix: instead of trusting the first output, the agent evaluates its own work against a standard and revises until the work is good enough. If prompt chaining gave the agent a pipeline of focused steps, reflection gives it the one step humans rely on most: a second look before the work goes out the door.

What Is the Reflection Pattern?

Reflection is a design pattern where an agent generates an output, evaluates that output against a set of criteria, and uses the critique to produce a revised version. The loop repeats until the work meets a quality threshold or hits a budget. Antonio Gulli, in Agentic Design Patterns, frames it as the move from single-pass generation to iterative self-improvement. The agent stops being a one-shot text generator and starts behaving like a craftsperson who checks their own work.

You will see it called several names: the evaluator-optimizer loop, the generator-critic pattern, or just self-correction. The mechanics are the same. One role produces a candidate answer. A second role judges it and says what is wrong. The first role tries again with that feedback in hand. Anthropic's Building Effective Agents describes the same shape as the evaluator-optimizer workflow. It lists the pattern among the handful worth reaching for when a single call is not reliable enough.

A circular loop of three nodes, generate, critique, and revise, with a candidate output passing between them and a single accent marking the point where the work clears its quality bar and exits the loop
Reflection runs a generate, critique, revise loop. The output cycles back through review until it clears a quality bar, then exits, instead of shipping the unreviewed first draft.

The distinction from plain generation matters. A single call commits to its first answer with no chance to catch its own mistakes. A reflection loop builds in the catch. The model gets to be wrong on the first try, which is the try it is worst at, and right by the third.

How the Reflection Loop Works

A reflection loop runs as a cycle with a clear exit. The shape is three moves and a gate.

  1. Generate. The agent produces a first-pass output: a function, a proof, a paragraph, a research summary. This is the candidate, not the final answer.
  2. Critique. The output is evaluated against explicit criteria. Does the code pass the tests? Does the argument hold? Are the claims supported? The critique returns specific, actionable feedback, not a thumbs-up or thumbs-down.
  3. Revise. The agent rewrites the candidate using the critique, fixing the named problems while keeping what already worked. This is a targeted edit, not a fresh start.
  4. Check the gate. If the revision clears the quality bar, the loop exits and returns the result. If not, and there is budget left, the revised output goes back to step two.

The gate is the part teams forget, and it is the part that keeps the loop from running forever. Without a stopping rule, an agent can polish the same paragraph twenty times, burning tokens and latency for changes nobody asked for. We will come back to stopping criteria, because they are where this pattern lives or dies.

Self-Reflection vs. an External Evaluator

There are two ways to run the critique step, and the choice shapes everything downstream.

Self-reflection uses the same model to judge its own output. You prompt it to step back, find the flaws in what it just wrote, and list them. This is cheap and surprisingly effective for tasks where the model knows the standard but did not apply it the first time. The catch is the obvious one: a model that missed a bug while writing the code may miss the same bug while reviewing it. Its blind spots are correlated across both passes.

An external evaluator uses a separate critic to judge the output: a different prompt, a different model, or a hard tool. A separate model brings a fresh perspective and uncorrelated blind spots. Better still, the evaluator can be something deterministic: a test suite, a linter, a schema validator, a compiler. When the critic is a tool rather than another model, the feedback is ground truth, not another opinion. A unit test that fails is not guessing.

The strongest reflection setups lean on real signals wherever they exist. For code generation, the evaluator runs the tests. For a data extraction task, it validates the JSON against a schema. For research, it checks whether each claim traces to a source. Self-reflection fills the gaps where no hard check exists, like tone, clarity, or argument quality.

Code Example (Abbreviated)

Here is a minimal reflection loop in a LangGraph style. The generator writes code, an evaluator runs the tests, and the loop revises until the tests pass or the budget runs out.

# Abbreviated: illustrative generate-critique-revise loop, not production code
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o") MAX_ITERS = 3

def generate(task, feedback=""): prompt = f"Write a Python function for:\n{task}\n{feedback}" return llm.invoke(prompt).content

def evaluate(code): # External evaluator: run tests, return (passed, feedback) return run_tests(code) # ground truth, not another opinion

code = generate(task) for i in range(MAX_ITERS): passed, feedback = evaluate(code) if passed: break # quality gate met: stop, don't keep polishing code = generate(task, feedback=f"Fix these failures:\n{feedback}")

The same shape holds in Google ADK and CrewAI, where a critic agent reviews a worker agent's output and the orchestrator routes the work back for another pass. The framework changes; the generate, critique, revise loop does not.

Where Reflection Pays Off (and Where It Does Not)

Reflection is not free, so it earns its place only on tasks where a second pass measurably improves the work. Four categories are where it consistently pulls its weight.

Code generation. This is the clearest win, because the evaluator can be a test suite. The agent writes a function, the tests run, the failures come back as concrete feedback, and the next draft fixes them. The loop turns a model that writes plausible-looking code into one that writes code that actually runs.

Math and structured reasoning. A model can check a derivation step by step, catch an arithmetic slip or a logical gap, and correct it. The verification pass catches errors the generation pass glides over, because checking and producing are different cognitive jobs.

Complex writing. Long-form drafts and legal summaries improve when a critic pass checks them against a rubric for structure, completeness, and tone. The first draft gets the ideas down; the reflection pass makes them land.

Research and analysis. An agent that drafts a research summary can review it for unsupported claims and shaky sourcing, then revise. This is where reflection guards against the most expensive failure mode in enterprise AI: an answer that is confident, wrong, and cited.

Enterprise reality: A coding agent that writes a database migration, runs the test suite, reads the three failures, and ships the fixed version on its second pass is doing the work a senior engineer would trust. A coding agent that writes the migration and ships it unread is the one that takes down production on a Friday. The difference is one evaluation step and a loop, and that is the difference between a demo and something you let near a real system.

Some tasks should skip reflection entirely. A one-line classification, a simple lookup, or a format conversion does not need a critic. The first pass is already as good as the tenth, and the extra calls only add cost and delay. Reflection is for work where quality is hard-won and verifiable, not for work that is already easy.

The Cost and Latency Tradeoff

Every loop iteration is another round of model calls. A reflection loop that runs three passes costs roughly three times the tokens and three times the latency of a single generation, and the context usually grows at each step as the critique gets appended. For a user waiting on an answer, that adds up fast.

This is the central tension of the pattern. Reflection buys quality with time and money, and the exchange rate is not always worth it. A loop that turns an 80%-correct answer into a 95%-correct one is often worth three times the cost. A loop that turns a 94%-correct answer into a 95%-correct one rarely is. The art is knowing which task you are on.

The defenses are familiar from the rest of the series. Use a cheaper, faster model for the critique step when the task allows it, the same way routing and parallelization sends easy work to small models. Cap the iterations so the loop cannot spiral. And reserve the full loop for the high-stakes outputs where a wrong answer costs more than a few extra seconds, not for every call your system makes.

Stopping Criteria: Don't Loop Forever

The single most important design decision in a reflection loop is when to stop. An agent with no exit condition will revise until it runs out of budget, often making the work worse, not better, as it second-guesses good decisions. Three stopping rules, used together, keep the loop honest.

  • Quality threshold. Stop when the output clears an objective bar: all tests pass, the schema validates, the rubric score crosses a cutoff. This is the cleanest exit because it is tied to a real signal, not a guess.
  • Iteration cap. Stop after a fixed number of passes, often two or three, no matter what. This guarantees the loop terminates and bounds your worst-case cost and latency. It is the seatbelt for when the quality threshold never quite gets hit.
  • Diminishing returns. Stop when a revision barely changes the output, or when the critic stops finding substantive problems. If two consecutive passes produce near-identical results, the loop has converged and further iterations are wasted spend.

A well-built loop combines all three: revise until the quality bar is met, but never exceed the iteration cap, and bail early if the output stops improving. The goal is the smallest number of passes that gets the work over the line, not the largest number the budget allows.

When to Use Reflection (and When Not To)

Reach for reflection when the task has a quality bar that a first pass routinely misses.

  • Reach for it when correctness is verifiable: code with tests, data against a schema, math you can check, claims you can trace to sources.
  • It earns its keep on high-stakes outputs where a wrong answer is expensive: a migration, a legal summary, a financial calculation, anything a human would review before trusting.
  • The same goes for work where quality is hard-won: long-form writing, complex reasoning, multi-constraint problems where the first draft predictably falls short.

Other tasks should not pay for the loop at all.

  • Skip it for simple, single-shot tasks where the first answer is already reliable. A classification or a lookup does not need a critic.
  • Drop it when latency is critical and the quality gain is marginal. A loop that shaves a percent off the error rate is not worth tripling the response time.
  • And leave it out when you have no real way to evaluate the output. Reflection without a meaningful critique step is just three calls pretending to be quality control.

The honest test is whether the critique step has something true to check against. If it does, reflection turns a shaky first draft into trustworthy output. If it does not, you are paying for motion, not improvement. Once the loop is running in production, the next job is watching it. That is where monitoring and evaluation picks up, tracking whether the revisions improve quality over time or just burn budget.

Key Takeaways

  • Reflection is the self-correction pattern: an agent generates output, critiques it against criteria, and revises in a loop until the work clears a quality bar or hits a budget.
  • The loop is three moves and a gate: generate, critique, revise, then check whether to stop. The gate is the part teams most often forget.
  • Self-reflection (the model judges itself) is cheap but shares the model's blind spots; an external evaluator, especially a hard tool like a test suite, gives ground-truth feedback instead of another opinion.
  • It pays off most on code generation, math, complex writing, and research, the tasks with a quality bar a first pass routinely misses, and is wasted on simple single-shot work.
  • Always set stopping criteria by combining a quality threshold, an iteration cap, and a diminishing-returns check, so the loop terminates instead of polishing forever.

Is your site invisible to AI search?

Get a free AEO infrastructure audit and find out what your competitors are doing that you're not.

Get Your Free Audit
Quick answers

Frequently asked.