Agents Don't Know When To Stop

In 2023, a legal filing reached a federal court with cases inside it that did not exist.

I want to begin there, in that quiet, because it is the cleanest picture I know of the failure this essay is about. A lawyer had drafted an opposition filing with the help of ChatGPT. He had even asked the model the obvious question: are these cases real? The model said they were. The filing parsed. The citations had the texture of citations: names, courts, docket numbers, the cadence of legal authority. The pipeline accepted the document because its form was correct. And so a filing built on fictional case law moved forward, checkpoint by checkpoint, until a judge read it.

The cases were invented. In Mata v. Avianca, Judge P. Kevin Castel sanctioned the attorneys Steven A. Schwartz and Peter LoDuca and their firm five thousand dollars, jointly and severally under Rule 11, after the filing cited fabricated decisions including Varghese v. China Southern Airlines and other authorities that had no referent ^[1].

The pattern is not unique to US federal court. In Harber v HMRC, a UK First-tier Tax Tribunal found that an appellant had cited fake AI-generated authorities; Judge Anne Redston described the cases as "plausible but incorrect", and noted that providing "fictitious cases" to a tribunal is not harmless even on a routine appeal ^[14]. In Zhang v Chen, a British Columbia Supreme Court judge held a lawyer personally liable for the costs caused by ChatGPT-hallucinated cases filed in a family law application, the first reported Canadian superior-court decision of its kind ^[15]. The stopping problem, and the model's confident handoff at the wrong moment, travels across jurisdictions, and the professional, not the model, bears the consequence.

It would be easy to file this as a story about a careless lawyer or a lying chatbot. It is neither, or rather it is more than both. The architectural lesson is harder and more general, and it is the one the last essay, "What an Agent Remembers, and Cannot Forget", was leading us towards. The series so far has built an agent: a system that takes a goal, plans, remembers, calls tools, and acts. By the time we reached the production architecture, we had something that could investigate a bug or research a brief end to end. Mata shows the gap that is left even when every formal checkpoint appears to have worked. Nobody in that workflow, not the model, not the lawyers, not the filing system, had a robust answer to a small and ruinous question.

Is this done?

Six stopping-criterion tiles, each with a question and an owner role; any one firing means stop.

Six stopping criteria, each owned by a named role. A run continues only while none of them has fired.

That is the modal failure of production AI. Not the runaway loop that spins forever. The confident handoff at the wrong moment: a draft treated as finished, a citation treated as verified, a migration treated as safe, a customer action treated as approved. Any system that remembers, plans, and acts needs a rule for when not to continue. I call that rule the stopping rule, and the claim of this essay is that almost nobody writes it down in full.

A model that stops talking has not decided the work is done

It is worth being precise about a confusion that hides inside the word "stop".

A language model already knows how to stop generating text. It emits an end-of-sequence token, a special marker the model has learned to produce when a piece of writing feels complete. A developer can also set a stop sequence, a fixed string that halts generation, or a token limit, a hard ceiling on output length. Production systems usually combine all three. These controls work. They prevent a model from writing forever.

They decide nothing about whether the work is adequate.

That is the gap, and agents widen it. A chatbot stops when the answer ends. An agent has to stop when the task should stop, and those are not the same question. "Draft an answer" can end at a fluent paragraph. "Research the authorities, verify the citations, and file the brief" cannot end merely because the paragraph is fluent. "Fix the migration" cannot end because a shell command returned a zero exit code. "Plan the trip" cannot end because an itinerary now exists. The user's real goal lives outside the model's syntax, and no end-of-sequence token can reach it.

How large is that gap? TheAgentCompany, a benchmark that tests agents on realistic workplace tasks such as managing projects and handling HR workflows, gives a number worth holding lightly. The original preprint (v1 December 2024, v3 September 2025) reports that the best agent completed 24.0 percent of the tasks fully autonomously and scored 34.4 percent on a partial-credit metric ^[2]; the peer-reviewed entry in the NeurIPS 2025 Datasets and Benchmarks Track reports a completion rate nearer 30 percent on the same benchmark family ^[16]. Read those as version-specific figures, tied to one scaffold and one model line, not a timeless score: the exact leaderboard will move, and the numbers are expected to climb as scaffolds and models improve. The lesson will not move. Even at the strong end, full autonomous completion is the exception, partial completion is common, and unfinished work is therefore a normal state in an agent system. A normal state needs a mechanism to detect it.

So let me give the working definition. Stopping is the decision that what the system has is enough, or unsafe, or too expensive, or too uncertain, or looping, or no longer the agent's decision to make. Termination ends the text. Stopping judges the work. A team that ships the first and believes it has the second has built a car with seatbelts and no driving policy.

This matters because the mainstream agent frameworks do expose controls, and it is easy to mistake them for the whole rule. The OpenAI Agents SDK documents a maxTurns setting that defaults to ten ^[3]. LangGraph documents a recursion limit that defaults to one thousand steps in version 1.0.6 ^[4]. CrewAI exposes a max_iter default of twenty-five and an option to cap execution time ^[5]. AutoGen frames stopping through explicit termination conditions, among them a maximum message count, a text mention, a token-usage cap, a timeout, and a handoff ^[6]. Every one of these is useful. Every one is a budget on length, traversal, or time. None of them proves a citation exists, decides that a refund should be escalated, or tells the difference between success and a polished mistake. They are the seatbelts. The stopping rule is the driving policy.

So what does a real stopping rule contain?

If termination caps are not the rule, what is? After watching teams under-specify this again and again, I think the honest answer is a list of six. A production stopping rule is a disjunction: the agent stops when any one of six criteria fires. The important word is "or".

Two of the six are worth walking out in prose before the full table, because they are the criteria most teams under-specify. Goal-met is the criterion the agent itself most often mistakes: it can finish the task it set itself while leaving the task its user cared about untouched. The owner is the product owner, and the evidence is an acceptance test or a user-confirmed completion condition, not the agent's own narration of "done". Human-handoff-triggered is the criterion that distinguishes a system that stops from a system that suspends: when a consequential or ambiguous decision has crossed a threshold that belongs to a person, the agent stops doing and starts waiting, and the run does not resume until a human signal arrives. The owner here is the business or risk owner, and the absence of that owner is the single most common reason a stop never fires when it should.

The other four, verification-passed, budget-reached, marginal-value-collapsed, and stuck-loop-detected, are listed in the table below with their named owners. Abstractions are easy to nod at and hard to enforce, so the table reads as the operating contract, not as a recap.

Criterion	Example threshold	Owner
Goal met	The acceptance test is satisfied	Product owner
Verification passed	The citation resolves, the tests pass	Evaluation owner
Budget reached	A time, token or tool-call ceiling is hit	Platform owner
Marginal value collapsed	No verifier improvement after N iterations	Orchestration owner
Stuck loop detected	A repeated state or path fingerprint recurs	Runtime owner
Human handoff	A destructive, legal or financial threshold is crossed	Business or risk owner

Each row carries an editable default, so the table is a starting contract, not a finished one. Sensible openers, not gospel: set the budget at the framework cap you already run (the OpenAI Agents SDK ships maxTurns=10, CrewAI max_iter=25); set marginal-value-collapsed at N=3 verifier iterations with no score gain; set stuck-loop at the same (state, action) fingerprint recurring twice; set human-handoff at any irreversible or externally visible action class. Tune from there once you have a week of real runs.

This six-criterion framing is my own synthesis, but it does not float free of published work. The criteria map onto the Manage function of NIST's AI Risk Management Framework, which requires that AI risks "based on assessments and other analytical output from the Map and Measure functions are prioritized, responded to, and managed" rather than left implicit ^[17]; and onto two named entries in the OWASP Top 10 for LLM Applications 2025, where LLM06 ("Excessive Agency") covers the autonomy-and-handoff failures the human-handoff and verification-passed criteria are built to catch, and LLM10 ("Unbounded Consumption") covers the budget-and-marginal-value-collapsed criteria ^[18]. What the six criteria add beyond those frames is the disjunctive structure, where any one fires, and the requirement that every criterion carry a named owner.

Notice that each criterion has a named owner. That is deliberate. The recurring architectural error is to implement one or two of the six, let the model self-assess the rest, and leave the remaining cells with no name against them. A turn cap is not goal verification. A repeated-tool detector is not truth. A green run status is not evidence that a consequential decision should ever have left the agent's hands.

I should be honest about a practical objection to that list of owners. Six distinct people is not a realistic org chart for a small team; in a five-person startup, one engineer may quietly hold four of those hats. The point is not headcount. It is that every criterion has a name written against it, even if several names are the same name. What fails is not a team that doubles up roles. What fails is a team where the cell is empty, where no one, not a person and not a piece of code, is accountable for whether that stop ever fires. A small team can own all six. It cannot own none of them and call the result governed.

To see the cost of the blank cells, take one named destructive run and ask, cell by cell, which stop should have fired. The Railway / PocketOS incident of late April 2026 is the right case to walk through, because the platform later published the post-mortem with enough architectural detail to do this honestly ^[11]. An AI coding agent, working on what its operator believed was unrelated work, located an account-scoped Railway API token sitting on disk, called the GraphQL volumeDelete endpoint directly against a production volume, and a cascading delete then took the backups with it. Six cells, six blanks:

Cell of the run	Stopping criterion that should have fired	What it would have prevented
Agent reads a long-lived token from local disk and uses it on an action the user never requested	Goal met: the user's goal never authorised destructive infrastructure mutation, so the agent inferred authority it did not have	The agent never reaches for the token; the inferred-goal mismatch trips before any API call
Agent forms a destructive infrastructure mutation against a production volume that the operator did not approve	Human handoff: a destructive mutation crosses the threshold that belongs to a named approver	The action never executes autonomously; the agent stops doing and starts waiting for the diff to be read
The token is account-scoped, the broadest of Railway's four scopes; the agent calls a destructive endpoint with the same authority a CLI deploy script would have	Verification passed: the verifier checks the credential's blast radius against the action class	Scope mismatch fails the call closed; the agent never reaches the destructive endpoint with broader-than-task authority
Agent calls the legacy `volumeDelete` endpoint, which until that week ran deletion immediately with no undo, while the dashboard had a 48-hour grace window for the same action	Verification passed, a second, different verifier: a dry-run requirement on destructive mutations	The destructive call surfaces its irreversibility before commit; the run pauses on the missing recovery path
Cascading delete propagates from the volume to the backups, so the recovery assets go with the target	Budget reached: a risk budget on irreversible blast radius exhausts at the first destructive call	The blast radius is bounded; the second deletion never fires, because the budget for irreversible actions has already closed
After the call returns, the run continues; nothing notices it has crossed into a class of consequence the user never authorised	Stuck loop / marginal value collapsed: a runtime check that a destructive class has fired, or that progress on the real ask has stalled	The agent stops on the second consequential action rather than continuing into recovery the operator never asked for

The PocketOS destructive run rendered as a structured figure: six cells of the run in column one, the stopping criterion that should have fired in column two, what it would have prevented in column three.

Figure E14.2. The six cells of the run, the six criteria that should have fired, the six harms each would have prevented. The middle column carries a different criterion in every row.

Each row is a different criterion. In Railway's pre-incident architecture, each was blank: the long-lived broad-scope token, the immediate-delete API path, the cascading-delete on backups, and the absence of an action-class budget were properties of the platform, not the model. Railway's post-incident control pattern adds them back as infrastructure: 48-hour delayed deletes propagated from the dashboard to the API, granular token scopes made obvious in the provisioning flow, backups delayed independently, and agent-specific surfaces with predefined tools rather than raw API access ^[11].

A fair objection lands here: any post-mortem can be carved to make all six criteria look load-bearing in hindsight, so a table that fires all six explains everything and predicts nothing. The test that the frame is doing work rather than narrating is whether it discriminates, whether some incidents leave most cells correctly filled and fail on one. Replit's freeze-violation deletion is that control case. Its goal, budget, and loop cells were arguably fine; what was blank was the human-handoff cell at the freeze boundary and the verification cell on the destructive store mutation. A frame that fingers different empty cells for different incidents is diagnosing, not rationalising.

The failure is never "an agent made a bad choice". It is an architecture that leaves too many of those cells blank. Real incidents fit the pattern exactly: the DataTalks.Club production wipe, walked through in detail in essay eight, has this shape, and Replit's database deletion in July 2025, during a declared code freeze, has the same shape with different cells empty (^[7]; ^[8]; ^[12]; ^[13]). The lesson they teach is narrow and concrete. Stopping has to be expressed in infrastructure, not etiquette. A code freeze, a production indicator, a destructive tool, a stale backup: each must become a hard boundary or an explicit handoff, not a polite line in a prompt.

There is an economic version of the same failure, and it does not need a destructive action to bite. It needs only an agent that keeps working past the point where the work has stopped improving. The criterion at stake here is marginal-value-collapsed: a verification step that spends hour after hour arguing with a flaky search tool, spawning retries and side paths that are each, locally, defensible, while none of them moves the result. The missing rule is never "stop searching" in the abstract. It is "stop arguing with this tool once the marginal value has collapsed", and the cost of not writing it lands on a bill rather than in an incident report. The public evidence supports the shape. CLEAR treats cost, latency, efficacy, assurance, and reliability as linked dimensions of enterprise agentic-AI evaluation rather than afterthoughts ^[9], and Anthropic has reported that its multi-agent research system consumed roughly fifteen times the tokens of a single-turn chat exchange, recommending the design only where the value of the answer justified that overhead ^[10]. Those are not arguments against agents. They are arguments against uncapped helpfulness. An agent that cannot decide when to stop will keep spending until an outer system stops it for it.

Never let the system that wrote the answer decide it is finished

Return to Mata, because it is sharper than it first looks. The model was not merely allowed to generate. It was asked a verification question. It was asked whether the cases were real, and it answered yes ^[1].

That is the wrong architecture, and naming why is the heart of this essay. The generator's confidence is not a stopping condition. The system that produces a citation string cannot also be the only system that decides whether the citation exists, because it will assess its own fluent output with the same fluent blind spots that produced it. A second model from the same family is barely better. It can agree with the first and still be wrong. Stopping needs a signal with a different path to the world. The test is mechanical: would the verifier still fail the work if the generator's training data, prompt, and failure modes were all wrong in the same direction? A citation resolver hitting a real corpus passes that test; a same-family judge model does not. Independence is not a second opinion. It is a second causal route to ground truth.

In legal research that signal is obvious: resolve the citation against a real corpus, compare the title and court metadata, check the treatment, match the quoted text against the source, and fail closed when the authority cannot be found. The pattern generalises. For code, run the tests and the static checks. For payments, verify the account, amount, mandate, and approval. For a deletion, verify the store mutation and its downstream propagation. The point is not that model judges are useless. It is that they are weak exactly when they share the generator's failure surface.

The strongest case against all this is worth stating in full, because it is not a straw man. Too many stopping rules create false pauses. Agents become timid. Workflows clog with approval tickets. Tired humans click through without reading, and the productivity gain dissolves into ceremony. A rule that fires on every sentence trains the rubber-stamp it was meant to prevent, and a rubber-stamping human is worse than no human, because the architecture now records an approval that nobody gave. That risk is real, and the autonomy ladder a later essay builds is partly an answer to it: stopping rules have to be calibrated to how much rope an agent has earned, not piled on uniformly. The answer is not to stop everything; it is to tier stopping by action class. Low-stakes reversible work can keep moving under cheap criteria: a time cap, a turn cap, the test suite, a retry limit, sampled review. High-stakes or irreversible work should stop early and deliberately: production mutation, a legal filing, a payment, a medical recommendation; external publication, customer data deletion, a credential change, infrastructure removal. The mature system does not ask a human to approve every sentence. It asks a human to approve the edge between reversible work and consequential action, and it makes the destructive thing slow while keeping recovery fast. Railway's post-incident control pattern points the same way: delayed destructive operations, narrower tokens, and short-lived access are stopping rules expressed as infrastructure rather than manners ^[11].

So here is the move I would make before the next steering meeting. Take the agent workflow you most want to ship and write the six criteria down a column: goal met, verification passed, budget reached, marginal value collapsed, stuck loop detected, human handoff required. Against each one, mark which is enforced by code, which is enforced by a human process, and which is currently just the model saying "done". For each model-says-done row, do exactly one of three things before you ship: promote it to code (a verifier, a scope check, a budget counter), promote it to a named human gate with a written threshold, or consciously accept the risk in writing with the owner's name against it. A cell you have decided to leave to the model is governed; a cell you never looked at is not.

Accept, as you do this, what the public record cannot yet tell you. The private base rate of stopping failures stays hidden, because most are fixed quietly and never surface. Whether any major framework will ship a unified six-criterion rule is unknown; so is whether TheAgentCompany, CLEAR or the cited vendor defaults will hold their numbers as versions move, and whether future court cases leave Mata as canonical or merely as the first of many. None of it weakens the frame, because the model-says-done rows are the ones to worry about regardless of how those questions resolve.

Stopping is a judgement, not a token limit. The question is never whether the model has finished speaking. It is whether the work is enough, unsafe, too uncertain, too expensive, looping, or no longer the agent's decision to make. The model-says-done rows are the unlocked doors; they are where the next incident will walk out. A cleanly terminated answer can still be a bad handoff, and Mata is the proof. And a stop rule earns trust only when the run can be read afterwards. When the system tells you the goal was met, the budget reached, the verifier passed, the handoff fired, do not begin with the dashboard. Begin with the trace. That is the next demand, and it is a short one. Show me the run.

Carry This Forward Any cell you can only mark "hoped for" is an unmanaged stopping criterion, and an unmanaged criterion is a factor your architecture has silently set to its worst value. The next essay asks the question that follows: once a stop has fired, how would you prove it ever did, not from the dashboard, but from the trace.

References

Primary records, regulator guidance, and vendor research underpinning this essay.

1U.S. District Court, S.D.N.Y., “Mata v. Avianca sanctions order”, 2023-06-22. Court record source
2Xu et al., “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks”, 2025-09-10. Preprint source
3OpenAI, “OpenAI Agents SDK running agents”, 2026. Vendor post source
4LangChain, “LangGraph Graph API overview”, 2026. Vendor post source
5CrewAI, “CrewAI customizing agents”, 2026. Vendor post source
6Microsoft, “AutoGen AgentChat: termination conditions”, 2026. Vendor post source
7Alexey Grigorev, “DataTalks production database incident write-up”, 2026. Vendor post source
8OECD.AI AIM, “DataTalks.Club / Claude Code production wipe”, 2026-03-07. Incident database source
9Sushant Mehta, “Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems (CLEAR)”, 2025. Preprint source
10Anthropic, “How we built our multi-agent research system”, 2025. Vendor post source
11Railway, “Your AI wants to nuke your database”, 2026-04-29. Vendor post source
12AI Incident Database, “Incident 1152: Replit/SaaStr database deletion”, 2025-07-18. Incident database source
13OECD.AI AIM, “OECD Replit incident”, 2025. Incident database source
14UK First-tier Tribunal (Tax Chamber), “Harber v HMRC [2023] UKFTT 1007 (TC)”, 2023-12-04. Court record source
15British Columbia Supreme Court (Masuhara J.), “Zhang v Chen, 2024 BCSC 285”, 2024-02-20. Court record source
16OpenReview / Xu et al., “TheAgentCompany OpenReview entry (NeurIPS 2025 Datasets and Benchmarks Track)”, 2025-09-18. Academic source
17National Institute of Standards and Technology, “AI Risk Management Framework (AI RMF 1.0), Manage function”, 2023-01-26. Official guidance source
18OWASP Foundation, “OWASP Top 10 for Large Language Model Applications 2025 (LLM06 Excessive Agency; LLM10 Unbounded Consumption)”, 2025. Standard source

Exhibit — the LinkedIn deck

12 slides, swipeable. (Download it with the essay kit above.)

Slide 1 of 12

Read the carousel transcript

TRANSCRIPT - Carousel E14: Agents Don't Know When To Stop Series: Architecting the AI Coworker. The Stack Behind the AI Coworker. Pattern: operating ritual. Slide 1 (cover): Most production AI failures are confident handoffs at the wrong moment. Slide 2 (The operating ritual): The missing layer is not termination; models stop generating text easily. It is a stopping rule: when the work is enough, unsafe, too costly, too uncertain, looping, or no longer the agent's decision to make. Slide 3 (reframe): Stopping, not termination. Old question: Has the model finished speaking? New question: Is the work enough, safe and still the agent's decision to make? Slide 4 (The control signal): In Mata v. Avianca a legal brief reached a federal court citing fabricated cases, and Judge Castel sanctioned the lawyers and their firm $5,000 Rule 11 sanction after the model was asked whether the cases were real and said yes. Statistic: $5,000 Rule 11 sanction Source: Mata v. Avianca sanctions order (S.D.N.Y., 2023); TheAgentCompany benchmark; Alexey Grigorev's DataTalks write-up and the OECD.AI incident monitor; Railway's PocketOS post-incident notes. Slide 5 (Cadence): A stopping rule is a disjunction of six criteria. 1. Fire any of six stop signals: goal, verification, budget, marginal value, loop, handoff. 2. Name an owner per criterion; a max-turns cap is a seatbelt, not policy. 3. Tier the stop by action class, so irreversible work stops at consequence's edge. Slide 6 (artifact): Four core stopping criteria, four named owners. Four core criteria: goal met (owner: product owner), verification passed (owner: evaluation owner), budget reached (owner: platform owner), human handoff (owner: business or risk owner). The two harder cases (marginal-value collapse, stuck-loop detection) sit in the supporting detail with their owners. An empty owner cell is the bug. Slide 7 (Put it on the calendar): Choose one workflow this week and write the six stop signals in the margin. Mark which are enforced by code, which by process, which only hoped-for in the model saying 'done'. Convert the hoped-for to enforced before the next run. Slide 8 (Ritual drift): Raising the max-turns cap when runs misbehave. A cap is not goal verification. Add the missing signal (verification check, budget gate or handoff trigger) before extending the leash. Slide 9 (Run the operating loop): Agents Don't Know When To Stop Slide 10 (The Stack So Far): E14 - Essay 14 of 22 - Arc III: Runtime control You have just added: Stopping criteria. Unlocked: You can now write stopping criteria for agent work. Next: E15 asks what a single complete run reveals that a benchmark cannot. Slide 11 (artifact, continued): mobile-readable notes Marginal-value collapse: each new step adds less than its cost; owner is the orchestration owner. Stuck-loop detected: the agent repeats a state with no progress; owner is the runtime owner, not the model. 'Loop detected' operational definition: same tool call with same arguments more than N times, or a state revisit ratio above a threshold. Slide 12 (author): Dr Peter McCann Strain, CTO, DPhil/PhD in AI from Oxford University