fix: prevent tool calls from leaking into text and detect fake tool call output by alzhang-git · Pull Request #14 · microsoft/amplifier-module-provider-github-copilot

alzhang-git · 2026-02-28T21:55:46Z

Summary

Fixes two bugs in tool calling where tool calls are skipped and instead printed as text.

Fix 1 — converters.py (Root cause prevention)

_extract_content(): Now skips ool_call/ ool_use blocks so they don't leak into serialized text
convert_messages_to_prompt(): Extracts tool calls from content blocks when the ool_calls key is missing, so conversation history is correctly serialized

Fix 2 — provider.py (Defensive detection)

When the LLM produces text containing [Tool Call: ...] but no actual structured tool calls, the provider detects this, adds a correction message ("use actual tools, don't write tool calls as text"), and retries (up to 2 times)
This prevents the user from seeing fake tool results that never actually executed

Testing

All 691 existing tests pass (2 pre-existing failures in est_coverage_gaps.py unrelated to this change)
Both modified files pass syntax validation

…all output Fix 1 - converters.py (root cause prevention): - _extract_content(): Skip tool_call/tool_use blocks so they don't leak into text - convert_messages_to_prompt(): Extract tool calls from content blocks when the tool_calls key is missing, so conversation history is correctly serialized Fix 2 - provider.py (defensive detection): - When the LLM produces text containing [Tool Call: ...] but no actual structured tool calls, the provider detects this, adds a correction message, and retries (up to 2 times) to prevent fake tool results from reaching the user Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alzhang-git

This is to address issue time by time tool calling is skipped. E.g. when writing to a file, tool calling is not executed and instead of print file as text

Root cause: [Tool Call: name({args})] format in prompt history was being mimicked by the LLM as plain text (36% of assistant messages had fake tool calls). Changed to XML tags that are less likely to be reproduced: - Tool calls: <tool_used name="name">args</tool_used> - Tool results: <tool_result name="name">content</tool_result> Additional improvements: - _extract_content(): skip tool_result and thinking blocks too - Detection regex: catch Tool Result(), <tool_used> patterns - Stronger correction message on fake tool retry - Updated all test assertions for new format Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mowree · 2026-03-01T01:32:45Z

✅ PR Review: Stress Test Validation Complete

Ran comprehensive stress tests — all passed:

Test Suite	Tests	Result
Cross-Model Structured Calls	6 models	✅ Passed
Saturated History Mimicry (5/15/25 turns)	18 tests	✅ Passed
Token Economics	3 tests	✅ Passed
Unit Tests	29	✅ Passed

Models: claude-sonnet-4, claude-opus-4.5, gpt-4.1, gpt-5-mini, o3-mini, gemini-3-pro-preview

🔧 One Fix Requested

Message list mutation at provider.py:1057:

The retry loop does messages.append(correction_msg) which mutates the caller's list in-place. After complete() returns, the caller's message history contains correction messages they didn't add. This can cause unexpected behavior in multi-turn conversations.

Suggested fix: Work on a copy with messages = list(messages) before appending.

💡 Other Suggestions (Non-blocking)

Consider moving regex to module level (style preference)
Consider adding <tool_result> to detection regex
Note: the fake-call regex matches <tool_used>, which the converter now emits to prompt history — potential for false positives if LLM echoes prompt content (our saturation tests didn't trigger this, but worth monitoring)

I'll plan to merge in 48 hours. Let me know if you'd like to address the mutation issue first — happy to wait for a quick fix. Otherwise LGTM overall! 🚀

Copilot

Pull request overview

This PR fixes prompt/serialization issues that caused tool calls/results to be emitted as plain text instead of remaining structured, and adds a defensive retry when the model “fakes” tool calls in text without producing structured tool calls.

Changes:

Update prompt serialization to represent tool usage/history via explicit XML-like tags and to prevent tool blocks from leaking into text content.
Recover missing assistant tool_calls by extracting them from OpenAI-style content blocks (tool_call / tool_use) when needed.
Add provider-side detection + retry when responses include tool-call-like text but no structured tool calls.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`amplifier_module_provider_github_copilot/converters.py`	Skips tool/thinking blocks in `_extract_content`, extracts tool calls from content blocks, and serializes tool usage/results with `<tool_used>` / `<tool_result>` tags.
`amplifier_module_provider_github_copilot/provider.py`	Adds regex-based detection of “fake tool calls” in text and retries with a correction message.
`tests/test_converters.py`	Updates assertions to match the new `<tool_used>` / `<tool_result>` serialization format.
`tests/integration/test_multi_model_saturation.py`	Updates tool-call-pattern counting to match the new `<tool_used ...>` format in serialized prompts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

amplifier_module_provider_github_copilot/provider.py

amplifier_module_provider_github_copilot/converters.py

…d test - Use spread operator for retry messages to avoid mutating caller's list - Add <tool_result> to fake tool detection regex - Add unit test for tool call extraction from content blocks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

alzhang-git · 2026-03-01T22:54:33Z

Thanks for the thorough review and stress testing @mowree!

All three items addressed in 2ab91ff:

Message mutation — Fixed. Using retry_messages = [*messages, correction_msg] so the caller's list is never mutated.
<tool_result> detection — Added to the regex.
Missing unit test — Added test_assistant_tool_calls_extracted_from_content_blocks.

Test results: 609 passed, 2 pre-existing failures in test_coverage_gaps.py (timing-related, unrelated to this PR).

alzhang-git commented Feb 28, 2026

View reviewed changes

alzhang-git and others added 2 commits February 28, 2026 14:14

chore: remove test artifact

c553df6

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mowree requested a review from Copilot March 1, 2026 01:32

Copilot started reviewing on behalf of mowree March 1, 2026 01:33 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

amplifier_module_provider_github_copilot/provider.py Outdated Show resolved Hide resolved

amplifier_module_provider_github_copilot/provider.py Show resolved Hide resolved

amplifier_module_provider_github_copilot/converters.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent tool calls from leaking into text and detect fake tool call output#14

fix: prevent tool calls from leaking into text and detect fake tool call output#14
alzhang-git wants to merge 4 commits intomicrosoft:mainfrom
alzhang-git:alzhang-bugfix

alzhang-git commented Feb 28, 2026

Uh oh!

alzhang-git left a comment

Uh oh!

mowree commented Mar 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alzhang-git commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alzhang-git commented Feb 28, 2026

Summary

Fix 1 — converters.py (Root cause prevention)

Fix 2 — provider.py (Defensive detection)

Testing

Uh oh!

alzhang-git left a comment

Choose a reason for hiding this comment

Uh oh!

mowree commented Mar 1, 2026

✅ PR Review: Stress Test Validation Complete

🔧 One Fix Requested

💡 Other Suggestions (Non-blocking)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alzhang-git commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants