The loop isn't the moat. The verifier is.

2026-06-11 · 12 min read

An agent is a loop. The model takes an action, something checks the result, it goes again, and it repeats until it hits the goal or you cut it off. That is the whole shape. Strip the branding off anything sold as an "agent" this year and you find some version of that loop underneath.

We build these loops for a living at Social Neuron, so weigh this accordingly. But here is the part nobody selling you an "agent framework" says out loud: the loop is about eight lines of code. You can write it tonight, so can your competitor, and it is not where the value is. The value is one line inside it: the line that checks the work.

That one line is the whole reason half the people you follow think agents 10x'd them while the other half think they are a scam. Both are telling the truth. Point a loop at a job with a real check and it quietly ships. Point the same loop, same model, at a job with no check and it burns a weekend and a few hundred dollars producing confident garbage. The technology is identical. The line is not.

The boring definition, because it matters

That definition is not mine. "An LLM that calls tools in a loop until it hits a goal" is Anthropic's own framing in Building Effective Agents, and it is deliberately unglamorous. One move, look at the result, pick the next move. A commenter on a widely-shared Hacker News thread put the uncomfortable version of it: the loop is trivial, and most of the "magic" is the underlying model and its tool-calling, not the few lines wrapping it. He is right. That is exactly why everyone can build one, and exactly why the loop cannot be the thing you defend.

And yet the loop is worth more than almost any model upgrade. Back in the GPT-3.5 era, Andrew Ng showed a weaker model wrapped in an agent loop reaching about 95% on the HumanEval coding benchmark, while a stronger model answering in one shot managed 67%. The loop was worth more than a full model generation. Hold both facts at once: the loop is trivial to write, and it is the biggest lever you have. That tension is the whole subject of this post.

Why the same few lines do both jobs

There is real math under the failure case. If one step succeeds with probability p, a chain of n steps that all have to land succeeds with probability p to the n. At 95% per step: five steps land 77% of the time, twenty steps 36%, fifty steps 8%. A 95%-reliable step is a strong B+ student. String twenty together with no correction and the agent finishes the whole task right about a third of the time. That is the math behind every "I left it running and it lost the plot" thread. Errors don't average out over a long run. They compound. That is an open loop: act, never check, carry every mistake forward.

A loop with a real check behaves the opposite way. It corrects. The check tells it the truth on every pass and it retries when it is wrong, so a 95%-reliable step that gets three independent tries only fails when it whiffs three times in a row, which is about once in 8,000. The same arithmetic that destroys an open loop quietly rescues a closed one. This is not new. It is the feedback principle control theory has used for a century to land planes and hold a car at 70mph: measure the gap between where you are and where you should be, correct, repeat.

So the question that decides which camp you land in isn't "which model" or "which framework." It is narrower: can your loop tell, without asking the model that just did the work, whether the work is right? A test. A type check. A schema. A compiler. Something that can say "no" in a language the model can't argue with.

> The loop is the cheap part. The verifier is the moat, and so is the ratchet behind it: every failure you hit becomes a hook, a rule, or a test, so the system never makes that mistake twice.

The trap that turns a "working" loop back into an open one

Here is where most loops that look closed are quietly open. If your check calls the same model that just produced the work, you don't have a verifier. You have a second opinion from the source of the first one, and models skew generous when they grade themselves. Anthropic's harness guidance and Addy Osmani's writing on harness engineering land on the same rule from different directions: keep the thing that makes the work separate from the thing that judges it.

There are numbers behind this now, and they are two separate findings worth keeping straight. Lance Martin, who works on agents at Anthropic, ran a loop on Parameter Golf, where an agent edits training code, launches a run, reads the score, and decides what to try next. First finding: a verifier sub-agent grading in its own context beats a model critiquing itself. It caught real issues the self-critique waved through, because a model reading its own reasoning rationalizes mediocre work while a cold grader sees only the result and the standard. Second finding: wrap the newer model in that loop and the generation gap widens. Claude Fable 5 improved the training pipeline roughly six times more than Opus 4.7, the model before it. So the separate grader is what keeps a loop honest, and a better model is what makes an honest loop compound. It is also why the newest primitives are built this way: /goal in Claude Code and Outcomes in Claude Managed Agents both spin up a separate grader with its own context window instead of letting the worker mark its own homework.

The practical test is blunt. npm test, tsc, a compiler, a schema validator, a human signing off: real verifiers, because the answer does not come from the model. "Does this look good?" asked to the model that wrote it, or a model scoring open-ended prose against a soft rubric: not verifiers, however much they feel like one.

Where loops actually work right now

Run that test across the tasks people throw at agents and the map falls out on its own. The jobs that work are the ones where the world tells the agent it is wrong, fast and unambiguously:

Coding against a test suite. The verifier is the tests passing or failing. Cheap when green; cost spikes only on a genuinely hard bug.
Log triage and reconciliation. The verifier is reconciling to a known total. Low, bounded, predictable.
Data extraction. The verifier is a schema. Cheap; reject and retry the bad rows.

And the jobs people quit on are the ones with no ground truth and a meter running: "run our whole strategy" (no verifier, unbounded cost) and open-ended research (no signal that says "done"). The mistake is never "using an agent." It's pointing one at the second list and expecting the behaviour of the first.

A check tells you if you won. It doesn't stop you burning money to find out.

This is the part the "do agents work" debate skips, and the part the people still running loops actually argue about. Teams rarely die on the unit price of a token. They die on a retry loop that quietly burns five to ten times the expected budget before anyone looks. A stopping rule isn't one setting, it is four:

A token ceiling per task. A hard stop in dollars, set low and unforgiving. The loop dies at the ceiling whether or not it was "nearly there," because "nearly there" is what a stuck agent always reports.
A retry cap per step. Three attempts on the same failing step, then escalate. No infinite re-tries of a patch that already failed twice.
A circuit breaker on repeats. The same error twice in a row means the model is stuck, not progressing. Halt and hand it to a human.
Log cost per green check, not cost per run. The number that matters is cost per success.

Concretely: a clean pass costs cents and a spiral costs real money, and the circuit breaker exists entirely because of the spiral. You only find that gap by logging cost per success, not cost per run.

This is what people mean when they say the harness, not the model, is the bottleneck. Addy Osmani puts it flatly: a decent model with a great harness beats a great model with a bad harness. Teams have moved an agent from thirtieth to fifth on a public coding benchmark by changing the harness alone, same weights. Anthropic found that token budget explained 80% of the performance variance in their research agents, more than the model or the tools did. The thing you actually tune is the loop around the model, not the model.

The long run dies a quieter death: context rot

"Watch your tokens" is too vague to act on. The specific failure is re-feeding the entire history into every pass, so each turn gets longer, slower, and a little dumber than the last. It has a name now, multi-turn degradation, and a Microsoft and Salesforce paper measured it: a 39% average drop in performance as conversations stretched, and once a model takes a wrong turn early it rarely recovers. The fixes are structural: summarize or reset the context between passes instead of appending forever; keep the plan in a file on disk and let each short session read it, do one thing, and write back; treat a long overnight run as many short fresh-context sessions sharing a file, not one ever-growing conversation.

Why the people who build these stopped writing prompts

> This morning I was managing maybe a few hundred [agents]. Some days it's thousands, or tens of thousands.

That is Boris Cherny, the creator of Claude Code, speaking at Fortune's Brainstorm Tech in June 2026. He says he hasn't hand-written a line of code in about eight months; another Claude does the prompting now. His loop babysits open pull requests: it watches the queue, fixes failing builds, reads new review comments, hands each one to an isolated worktree agent, and repeats until everything is green. The build passing is the verifier, baked in. Don't read "tens of thousands" as a flex you can't touch. Translate it: it means orchestrated sub-agents, each with its own check, so no single one can quietly go wrong. The loop you can run tonight is the same shape at a scale of one.

Andrej Karpathy made the same bet from the research side. His AutoResearch agent ran 700 training experiments in two days and surfaced 20 optimizations with no human in the loop. His warning is the line worth keeping: the value lives in the eval harness and the reward signal, not the agent loop wrapping them. The loop is plumbing. The check is the product.

The second loop is memory, and it usually backfires

Everything above is the inner loop: one task, one sitting. Wrap a second loop around it and the agent carries what it learned from one run into the next. It sounds obviously good. It usually isn't. Continual Learning Bench, a June 2026 benchmark from Asawa and colleagues at Berkeley and Snorkel, tested exactly this and found the opposite for most systems: plain in-context learning beat dedicated memory modules. Bolting on memory tended to make agents worse, because they hoarded stale, unchecked beliefs and overfit to whatever they saw last.

So the memory that helps is the verified kind. The reported edge for Fable 5 here comes from a discipline, not a bigger notebook: fail, investigate why, verify the diagnosis into a checked fact, distill it into a rule, then consult the rule next time instead of re-deriving it. Weak setups stop at "write down what happened." Strong ones only save what they checked. An unverified memory is just a confident guess you will trust later, so a memory loop with no verifier rots the same way an open inner loop does.

Where this framing is wrong, or at least incomplete

I have made the verifier sound like a clean switch: have one and you're safe, skip it and you burn. It is messier than that, and I would rather say so than sell you a slogan.

The verifier can be wrong. A test suite that passes but never covered the case that matters is a verifier that lies to you with total confidence, and a loop optimizing against it will happily satisfy the test while drifting from what you actually wanted. We have watched a version of this in our own content work: a model-judged quality score climbing run after run while the human read of the output got worse, because the loop learned to please the judge instead of the reader. The fix isn't a better slogan. It is grounding the loop in real outcomes, a real test of the real behaviour or real engagement on a real post, rather than any proxy a model can game, and keeping a divergence canary that screams when the proxy and the truth start pulling apart. A verifier buys you a closed loop. It does not buy you a verifier that is measuring the right thing. That part is still on you.

The best thing a closed loop can do is end

Once a loop solves the same problem reliably a few times, paying tokens to re-solve it is waste. Compile the working path into a plain function and cache it. Loops are for novel problems; production is for solved ones. At Social Neuron that is the shape of our content engine: the loop that finds what works for a brand hardens into a function that just ships it, and the analytics feed the next cycle. This is also the only moat the agentic era leaves you: as Ben Horowitz argues, defensibility has moved from code, which everyone's agents can now write, to proprietary closed-loop data, the per-brand record of what actually worked. So you don't read this as one more wrapper-economy think piece, here is what is actually wired into ours:

npm run ci:check plus a visual gate before anything publishes. The deterministic verifier.
Circuit breakers and confirmation gates inside runAgentLoop.ts. The step budget and the human gate.
Autopilot kill switches (AUTOPILOT_AUTOPUBLISH_ENABLED), so nothing goes out unattended unless we say so.
Real engagement fed back from analytics, so the loop closes on how a post performed, not on how good the model thought it was.

So: hype or functional? Functional wherever you hand the loop a cheap, honest verifier, a budget, and a scope small enough to defend. Hype everywhere people skip those and treat the model's confidence as the same thing as being right. The technology is identical on both sides of that line. The harness is the whole difference, and the harness is the part you can build today.