fix: prevent tool calls from leaking into text and detect fake tool call output#14
fix: prevent tool calls from leaking into text and detect fake tool call output#14alzhang-git wants to merge 4 commits intomicrosoft:mainfrom
Conversation
…all output Fix 1 - converters.py (root cause prevention): - _extract_content(): Skip tool_call/tool_use blocks so they don't leak into text - convert_messages_to_prompt(): Extract tool calls from content blocks when the tool_calls key is missing, so conversation history is correctly serialized Fix 2 - provider.py (defensive detection): - When the LLM produces text containing [Tool Call: ...] but no actual structured tool calls, the provider detects this, adds a correction message, and retries (up to 2 times) to prevent fake tool results from reaching the user Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
alzhang-git
left a comment
There was a problem hiding this comment.
This is to address issue time by time tool calling is skipped. E.g. when writing to a file, tool calling is not executed and instead of print file as text
Root cause: [Tool Call: name({args})] format in prompt history was being
mimicked by the LLM as plain text (36% of assistant messages had fake
tool calls). Changed to XML tags that are less likely to be reproduced:
- Tool calls: <tool_used name="name">args</tool_used>
- Tool results: <tool_result name="name">content</tool_result>
Additional improvements:
- _extract_content(): skip tool_result and thinking blocks too
- Detection regex: catch Tool Result(), <tool_used> patterns
- Stronger correction message on fake tool retry
- Updated all test assertions for new format
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
✅ PR Review: Stress Test Validation CompleteRan comprehensive stress tests — all passed:
Models: claude-sonnet-4, claude-opus-4.5, gpt-4.1, gpt-5-mini, o3-mini, gemini-3-pro-preview 🔧 One Fix RequestedMessage list mutation at The retry loop does Suggested fix: Work on a copy with 💡 Other Suggestions (Non-blocking)
I'll plan to merge in 48 hours. Let me know if you'd like to address the mutation issue first — happy to wait for a quick fix. Otherwise LGTM overall! 🚀 |
There was a problem hiding this comment.
Pull request overview
This PR fixes prompt/serialization issues that caused tool calls/results to be emitted as plain text instead of remaining structured, and adds a defensive retry when the model “fakes” tool calls in text without producing structured tool calls.
Changes:
- Update prompt serialization to represent tool usage/history via explicit XML-like tags and to prevent tool blocks from leaking into text content.
- Recover missing assistant
tool_callsby extracting them from OpenAI-stylecontentblocks (tool_call/tool_use) when needed. - Add provider-side detection + retry when responses include tool-call-like text but no structured tool calls.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
amplifier_module_provider_github_copilot/converters.py |
Skips tool/thinking blocks in _extract_content, extracts tool calls from content blocks, and serializes tool usage/results with <tool_used> / <tool_result> tags. |
amplifier_module_provider_github_copilot/provider.py |
Adds regex-based detection of “fake tool calls” in text and retries with a correction message. |
tests/test_converters.py |
Updates assertions to match the new <tool_used> / <tool_result> serialization format. |
tests/integration/test_multi_model_saturation.py |
Updates tool-call-pattern counting to match the new <tool_used ...> format in serialized prompts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…d test - Use spread operator for retry messages to avoid mutating caller's list - Add <tool_result> to fake tool detection regex - Add unit test for tool call extraction from content blocks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the thorough review and stress testing @mowree! All three items addressed in 2ab91ff:
Test results: 609 passed, 2 pre-existing failures in |
Summary
Fixes two bugs in tool calling where tool calls are skipped and instead printed as text.
Fix 1 — converters.py (Root cause prevention)
Fix 2 — provider.py (Defensive detection)
Testing