What happened?
An agent entered a reflex-retry loop: the model emitted exec tool calls with empty arguments ({}) on 25 consecutive iterations. Each call was dispatched by the runner, hit the exec tool's missing 'command' parameter validation error, returned the error as a tool result, and the model retried with the same empty arguments on the next iteration. After 25 iterations the loop terminated on the max_iterations limit with the run aborted.
From the user's perspective this looked like a 4-minute hang — 25 LLM round-trips at ~10 seconds each, with no visible progress in the activity log other than the same error line repeating. The CPU burn was Anthropic's API latency, not MOLTIS, but the user experience is indistinguishable from a stuck process.
The root cause is a reasoning failure on the model side (see "Additional context" for the model's own retrospective on what happened), but the reason the failure became a 25-iteration dead zone rather than a 2-3 iteration recoverable error is that the MOLTIS runner has no defensive layers between a malformed tool call and the tool's internal validation. This issue is about those missing defensive layers.
Four stacked issues in the runner
Issue 1: empty-args tool calls dispatched without validation
crates/agents/src/runner.rs L1630-L1644, the streaming tool-call accumulator:
StreamEvent::ToolCallStart { id, name, index } => {
let vec_pos = tool_calls.len();
debug!(tool = %name, id = %id, stream_index = index, vec_pos, "tool call started in stream");
tool_calls.push(ToolCall {
id,
name,
arguments: serde_json::json!({}), // ← default to empty object
});
stream_idx_to_vec_pos.insert(index, vec_pos);
tool_call_args.insert(index, String::new());
},
StreamEvent::ToolCallArgumentsDelta { index, delta } => {
if let Some(args) = tool_call_args.get_mut(&index) {
args.push_str(&delta);
}
},
Then the finalize loop at L1741-L1751:
for (stream_idx, args_str) in &tool_call_args {
if let Some(&vec_pos) = stream_idx_to_vec_pos.get(stream_idx)
&& vec_pos < tool_calls.len()
&& !args_str.is_empty() // ← silent skip
&& let Ok(args) = serde_json::from_str::<serde_json::Value>(args_str)
{
tool_calls[vec_pos].arguments = args;
}
}
If the model emits a content_block_start: tool_use with input: {} and no input_json_delta events follow (or the deltas arrive with a mismatched index, or the JSON fails to parse), the default json!({}) from L1636 is preserved silently. The tool call is then dispatched as-is to tool.execute(args) at L1298. Each builtin tool (e.g. exec) validates its own required fields internally and returns an error, which is wrapped as a tool result and fed back to the model.
The runner has the tool's parameters_schema() available (it sent it to the model at turn start at L1586-1588 or the corresponding non-streaming path). But it never validates the returned tool call against that same schema before dispatch. Each tool has to catch missing/invalid args individually in its own execute method, and the error goes to the LLM as a string, not to the runner's control flow.
Issue 2: no loop detection on repeated identical failures
Grep confirms: the only protection against repeated tool-call failures is the global max_iterations limit at L895:
if iterations > max_iterations {
warn!("agent loop exceeded max iterations ({})", max_iterations);
return Err(AgentRunError::Other(anyhow::anyhow!(
"agent loop exceeded max iterations ({})",
max_iterations
)));
}
This fires at iteration 25 (the default). There is no intermediate check for "same tool + same args + same error appearing N times in a row". A model stuck in a reflex-retry loop will burn through all 25 iterations regardless of how obvious the loop is from the runner's perspective.
Issue 3: error feedback to the LLM is too terse to break the loop
When tool.execute(args) returns an error, the runner wraps it at L1332-L1351:
Err(e) => {
let err_str = e.to_string();
// ...
(
false,
serde_json::json!({ "error": err_str }),
Some(err_str),
)
},
The tool result sent back to the LLM is literally {"error": "missing 'command' parameter"}. No indication of:
- What arguments the LLM actually sent (so it can see its own malformed call)
- That retrying with the same arguments will not work
- That if it doesn't know what arguments to use, it should respond in text instead of retrying
Claude-family models respond strongly to directive error text. A terse error like "missing 'command' parameter" provides too little signal for the model to break out of a flawed reasoning trajectory.
Issue 4: "Executing command..." UI status emitted before validation
The activity log shows 💻 Executing command... for each iteration, implying subprocess execution. In reality, the tool's validation fails immediately, before any process is spawned. The UI status is emitted from the runner callback at L1232-L1243 before the tool's execute method runs its own validation, so the user sees "Executing..." for calls that never actually executed anything. This is cosmetic but makes debugging much harder — the user reading the log reasonably believes commands are being dispatched and hanging, when in reality they're being rejected instantly at the validation boundary.
Evidence from the user session
User-observed activity log (reproduced exactly from their report):
⚠️ Iteration limit reached: The agent stopped after 25 iterations. You can continue if needed.
📋 Activity log
• 💻 Executing command...
• ❌ missing 'command' parameter
• 💻 Executing command...
• ❌ missing 'command' parameter
[... repeated 25 times total ...]
The run aborted on max_iterations = 25 with no intervention from iteration 3 onwards.
Why this matters
This is a known LLM failure mode. Any model under ambiguous instructions can reflex-pick a tool and emit malformed arguments, and the more constrained the model's reasoning state, the less likely it is to self-catch the pattern. The model cannot be patched to never do this — it's an emergent behavior from instruction-following training. What CAN be fixed is the runtime's ability to detect and interrupt the loop before it wastes 4+ minutes of wall time and 25 rounds of LLM inference.
Fixing this at the runner level protects every MOLTIS user against any variant of the same failure class — not just the exec({}) case we observed, but any future tool × model × prompt combination that produces a similar reflex loop.
Model self-report (corroborating evidence)
We asked the agent (Claude Haiku 4.5 via Anthropic) what happened in a follow-up conversation. The model's own retrospective, quoted verbatim:
"I got stuck in a loop calling exec with no command parameter — that's a user error on my part, not a runtime issue. [...] Instead of catching that and asking clarification, I just kept retrying the same broken pattern. This happened ~25+ times before I pivoted."
And when asked why specifically:
"I don't have a clear answer, and that's the problem. [...] I was in a mental state of 'I need to create/write/send something' but didn't actually construct the command. [...] I defaulted to trying to call a tool (any tool) to seem productive. exec without arguments fails, but I didn't treat that failure as a signal to ask you or reconsider. The retry loop kept going because the tool errors didn't trigger an escalation in my logic."
And, unprompted:
"If a tool call fails 2-3 times the same way, I should stop and tell you what's broken, not loop forever. [...] The system let me loop because I didn't escalate the problem."
The model is describing, in its own words, a reflex-retry loop it could not self-catch, and asking for exactly the loop detection that would have stopped it. This is strong ground-truth evidence that LLMs cannot always self-catch this failure mode, and that runtime-level defenses are needed.
Suggested fixes (ideas for the maintainer — please evaluate tradeoffs, not prescribed)
Four independent fixes. Any one alone would have prevented the 25-iteration loop in our case. Together they harden the runner against a whole class of future failures. These are suggestions to seed the discussion — please pick whichever combination makes sense for the project's direction.
Fix A — Loop detector with system-message intervention
Track the last N tool calls in the runner's loop state as (tool_name, args_hash) or (tool_name, error_message_hash). When M consecutive calls match the same key AND all failed, inject a system message at the top of the next iteration's messages array asking the model to stop and respond in text. Open questions for the maintainer:
-
Window size (M): 2 is aggressive and may interrupt legitimate single-retry patterns. 3 is a sweet spot (allows one retry, catches clear loops). 5 wastes more iterations. Our gut says 3, but the maintainer may have a different take based on what legitimate retry patterns look like in practice.
-
Identity key: exact (tool_name, args_hash) catches pure reflex loops but misses variants (exec({}) → exec({"command": ""}) → exec({"command": " "})). (tool_name, error_message) catches variants but may false-positive on legitimate same-error-different-args cases. A combined "either match fires" key is more thorough but more complex. Maintainer's call.
-
System message content: the model's response to this message is load-bearing. A weak nudge will be ignored; a strong one will break the loop. We drafted a candidate body based on what Claude-family models respond to:
SYSTEM INTERVENTION — LOOP DETECTED
Your last 3 tool calls were:
1. <tool_name>(<args>) → error: <error>
2. <tool_name>(<args>) → error: <error>
3. <tool_name>(<args>) → error: <error>
These are identical failed invocations. Retrying with the same arguments
will fail again.
On your next turn:
1. Do NOT call <tool_name> or any other tool.
2. Do NOT repeat this call pattern.
3. Respond to the user in plain text.
4. Explain what you were trying to accomplish.
5. If you do not know what arguments to use, ask the user for clarification.
The user is waiting for a text response.
Principles we used: directive language over polite, concrete evidence over abstract labels, explicit escape hatch (respond in text), named forbids (do not call <tool_name>), short reinforcing final sentence. Happy to iterate if the maintainer has opinions — we don't know Anthropic's training priors as well as you likely do.
Fix B — Defensive schema validation at dispatch time
Before calling tool.execute(args) at L1298, validate args against the tool's parameters_schema(). If required fields are missing or invalid, return a structured error WITHOUT calling the tool — an error that names the missing field, shows what was received, and tells the model explicitly not to retry with the same arguments:
Tool call rejected before execution: missing required field 'command'.
You sent: {}
Do not retry with the same arguments. If you do not know what command
to run, respond in text and ask the user for clarification.
This is the single most valuable behavioral fix in our opinion — Claude-family models respond very well to named error text that directs them what to do next. Every tool's execute method already does internal validation, but the error text it returns is usually terse ("missing 'command' parameter") because it's optimized for log output, not for LLM instruction. Doing the validation at the runner level lets the runner format the error for the LLM's benefit.
Fix C — Tool stripping as escalation (harder, more reliable)
If Fix A's system message is not enough to break the loop, a stronger option: on the turn AFTER the loop detector fires, pass an empty tool schema list to the LLM for one round-trip, forcing a pure text generation. The model literally cannot emit another tool call because there's no tool schema in its context for that turn. On the turn after, restore the normal tool schemas.
This is heavier-handed than a pure nudge but it's 100% reliable — it bypasses the question of whether the model obeys the nudge. The tradeoff is complexity: it requires the runner to track "intervention state" across turns. Probably worth it for the strongest variant, but may be overkill if Fix A alone works well in practice.
Fix D — Raw-args debug logging + reorder UI status
Two small diagnostic and cosmetic fixes:
-
Debug log at the finalize loop (L1741): debug!(stream_idx, args_str = %args_str, "finalizing tool call args"). Enables diagnosing whether future variants are cases of the model emitting input: {} or the stream accumulator losing delta events. One line, no runtime cost.
-
Move the ToolCallStart UI event emission to after schema validation passes. Currently emitted at L1232-L1243 before the tool's execute method runs its internal validation, which shows Executing command... for calls that never spawn a subprocess. Moving it prevents debugging confusion and better represents what's actually happening.
Priority suggestions
Our rough take on priority, though the maintainer may weigh differently:
- Fix A (loop detector + system message) — highest leverage, bounds the blast radius of any future malformed-call bug regardless of root cause. This is the one we'd recommend shipping first even before the others if only one lands.
- Fix B (defensive schema validation) — second highest, turns "silent broken dispatch" into "structured error the LLM can act on", fixes the specific reflex-retry class.
- Fix D debug log — small, enables diagnosis for future variants, low risk.
- Fix C (tool stripping escalation) — more complex, only needed if Fix A alone isn't reliable in practice.
- Fix D UI reorder — cosmetic but improves debuggability.
Is this a regression?
I don't know
Moltis version
20260410.01
Install method
Built from source
Component
Agent / LLM providers
Operating system
Other Linux (Alpine-based Fly.io container)
Additional context
Part of the same review pass as #631, #632, #633, #638 (closed), #639 (closed), #640, #641, #654, #655, #656, #657. The review is surfacing consistent patterns where documented/advertised behavior doesn't fire in production and where the runtime lacks defensive checks against known LLM failure modes. This particular issue is the highest-impact one we've found in terms of user-facing damage — a 4-minute dead zone with no visible progress is the worst kind of silent failure because the user has no signal to intervene.
I cannot submit a PR right now (not a Rust developer day-to-day, heading into vacation). Filing with full evidence and design-level suggestions so whoever picks this up has a concrete starting point. The four fixes are independent and can land separately if that's easier.
What happened?
An agent entered a reflex-retry loop: the model emitted
exectool calls with empty arguments ({}) on 25 consecutive iterations. Each call was dispatched by the runner, hit the exec tool'smissing 'command' parametervalidation error, returned the error as a tool result, and the model retried with the same empty arguments on the next iteration. After 25 iterations the loop terminated on themax_iterationslimit with the run aborted.From the user's perspective this looked like a 4-minute hang — 25 LLM round-trips at ~10 seconds each, with no visible progress in the activity log other than the same error line repeating. The CPU burn was Anthropic's API latency, not MOLTIS, but the user experience is indistinguishable from a stuck process.
The root cause is a reasoning failure on the model side (see "Additional context" for the model's own retrospective on what happened), but the reason the failure became a 25-iteration dead zone rather than a 2-3 iteration recoverable error is that the MOLTIS runner has no defensive layers between a malformed tool call and the tool's internal validation. This issue is about those missing defensive layers.
Four stacked issues in the runner
Issue 1: empty-args tool calls dispatched without validation
crates/agents/src/runner.rsL1630-L1644, the streaming tool-call accumulator:Then the finalize loop at L1741-L1751:
If the model emits a
content_block_start: tool_usewithinput: {}and noinput_json_deltaevents follow (or the deltas arrive with a mismatched index, or the JSON fails to parse), the defaultjson!({})from L1636 is preserved silently. The tool call is then dispatched as-is totool.execute(args)at L1298. Each builtin tool (e.g.exec) validates its own required fields internally and returns an error, which is wrapped as a tool result and fed back to the model.The runner has the tool's
parameters_schema()available (it sent it to the model at turn start at L1586-1588 or the corresponding non-streaming path). But it never validates the returned tool call against that same schema before dispatch. Each tool has to catch missing/invalid args individually in its own execute method, and the error goes to the LLM as a string, not to the runner's control flow.Issue 2: no loop detection on repeated identical failures
Grep confirms: the only protection against repeated tool-call failures is the global
max_iterationslimit at L895:This fires at iteration 25 (the default). There is no intermediate check for "same tool + same args + same error appearing N times in a row". A model stuck in a reflex-retry loop will burn through all 25 iterations regardless of how obvious the loop is from the runner's perspective.
Issue 3: error feedback to the LLM is too terse to break the loop
When
tool.execute(args)returns an error, the runner wraps it at L1332-L1351:The tool result sent back to the LLM is literally
{"error": "missing 'command' parameter"}. No indication of:Claude-family models respond strongly to directive error text. A terse error like
"missing 'command' parameter"provides too little signal for the model to break out of a flawed reasoning trajectory.Issue 4: "Executing command..." UI status emitted before validation
The activity log shows
💻 Executing command...for each iteration, implying subprocess execution. In reality, the tool's validation fails immediately, before any process is spawned. The UI status is emitted from the runner callback at L1232-L1243 before the tool's execute method runs its own validation, so the user sees "Executing..." for calls that never actually executed anything. This is cosmetic but makes debugging much harder — the user reading the log reasonably believes commands are being dispatched and hanging, when in reality they're being rejected instantly at the validation boundary.Evidence from the user session
User-observed activity log (reproduced exactly from their report):
The run aborted on
max_iterations = 25with no intervention from iteration 3 onwards.Why this matters
This is a known LLM failure mode. Any model under ambiguous instructions can reflex-pick a tool and emit malformed arguments, and the more constrained the model's reasoning state, the less likely it is to self-catch the pattern. The model cannot be patched to never do this — it's an emergent behavior from instruction-following training. What CAN be fixed is the runtime's ability to detect and interrupt the loop before it wastes 4+ minutes of wall time and 25 rounds of LLM inference.
Fixing this at the runner level protects every MOLTIS user against any variant of the same failure class — not just the
exec({})case we observed, but any future tool × model × prompt combination that produces a similar reflex loop.Model self-report (corroborating evidence)
We asked the agent (Claude Haiku 4.5 via Anthropic) what happened in a follow-up conversation. The model's own retrospective, quoted verbatim:
And when asked why specifically:
And, unprompted:
The model is describing, in its own words, a reflex-retry loop it could not self-catch, and asking for exactly the loop detection that would have stopped it. This is strong ground-truth evidence that LLMs cannot always self-catch this failure mode, and that runtime-level defenses are needed.
Suggested fixes (ideas for the maintainer — please evaluate tradeoffs, not prescribed)
Four independent fixes. Any one alone would have prevented the 25-iteration loop in our case. Together they harden the runner against a whole class of future failures. These are suggestions to seed the discussion — please pick whichever combination makes sense for the project's direction.
Fix A — Loop detector with system-message intervention
Track the last N tool calls in the runner's loop state as
(tool_name, args_hash)or(tool_name, error_message_hash). When M consecutive calls match the same key AND all failed, inject a system message at the top of the next iteration's messages array asking the model to stop and respond in text. Open questions for the maintainer:Window size (M): 2 is aggressive and may interrupt legitimate single-retry patterns. 3 is a sweet spot (allows one retry, catches clear loops). 5 wastes more iterations. Our gut says 3, but the maintainer may have a different take based on what legitimate retry patterns look like in practice.
Identity key: exact
(tool_name, args_hash)catches pure reflex loops but misses variants (exec({})→exec({"command": ""})→exec({"command": " "})).(tool_name, error_message)catches variants but may false-positive on legitimate same-error-different-args cases. A combined "either match fires" key is more thorough but more complex. Maintainer's call.System message content: the model's response to this message is load-bearing. A weak nudge will be ignored; a strong one will break the loop. We drafted a candidate body based on what Claude-family models respond to:
Principles we used: directive language over polite, concrete evidence over abstract labels, explicit escape hatch (respond in text), named forbids (do not call <tool_name>), short reinforcing final sentence. Happy to iterate if the maintainer has opinions — we don't know Anthropic's training priors as well as you likely do.
Fix B — Defensive schema validation at dispatch time
Before calling
tool.execute(args)at L1298, validateargsagainst the tool'sparameters_schema(). If required fields are missing or invalid, return a structured error WITHOUT calling the tool — an error that names the missing field, shows what was received, and tells the model explicitly not to retry with the same arguments:This is the single most valuable behavioral fix in our opinion — Claude-family models respond very well to named error text that directs them what to do next. Every tool's execute method already does internal validation, but the error text it returns is usually terse (
"missing 'command' parameter") because it's optimized for log output, not for LLM instruction. Doing the validation at the runner level lets the runner format the error for the LLM's benefit.Fix C — Tool stripping as escalation (harder, more reliable)
If Fix A's system message is not enough to break the loop, a stronger option: on the turn AFTER the loop detector fires, pass an empty tool schema list to the LLM for one round-trip, forcing a pure text generation. The model literally cannot emit another tool call because there's no tool schema in its context for that turn. On the turn after, restore the normal tool schemas.
This is heavier-handed than a pure nudge but it's 100% reliable — it bypasses the question of whether the model obeys the nudge. The tradeoff is complexity: it requires the runner to track "intervention state" across turns. Probably worth it for the strongest variant, but may be overkill if Fix A alone works well in practice.
Fix D — Raw-args debug logging + reorder UI status
Two small diagnostic and cosmetic fixes:
Debug log at the finalize loop (L1741):
debug!(stream_idx, args_str = %args_str, "finalizing tool call args"). Enables diagnosing whether future variants are cases of the model emittinginput: {}or the stream accumulator losing delta events. One line, no runtime cost.Move the
ToolCallStartUI event emission to after schema validation passes. Currently emitted at L1232-L1243 before the tool's execute method runs its internal validation, which showsExecuting command...for calls that never spawn a subprocess. Moving it prevents debugging confusion and better represents what's actually happening.Priority suggestions
Our rough take on priority, though the maintainer may weigh differently:
Is this a regression?
I don't know
Moltis version
20260410.01
Install method
Built from source
Component
Agent / LLM providers
Operating system
Other Linux (Alpine-based Fly.io container)
Additional context
Part of the same review pass as #631, #632, #633, #638 (closed), #639 (closed), #640, #641, #654, #655, #656, #657. The review is surfacing consistent patterns where documented/advertised behavior doesn't fire in production and where the runtime lacks defensive checks against known LLM failure modes. This particular issue is the highest-impact one we've found in terms of user-facing damage — a 4-minute dead zone with no visible progress is the worst kind of silent failure because the user has no signal to intervene.
I cannot submit a PR right now (not a Rust developer day-to-day, heading into vacation). Filing with full evidence and design-level suggestions so whoever picks this up has a concrete starting point. The four fixes are independent and can land separately if that's easier.