|
| 1 | +# Issue #262 Investigation Notes |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | +`session.call_tool()` hangs indefinitely while `session.list_tools()` works fine. |
| 5 | +The server executes successfully and produces results, but the client cannot receive them. |
| 6 | + |
| 7 | +## Key Observations from Issue |
| 8 | +- Debugger stepping makes issue disappear (timing/race condition) |
| 9 | +- Works on native Windows, fails on WSL Ubuntu |
| 10 | +- Affects both stdio and SSE transports |
| 11 | +- Server produces output but client doesn't receive it |
| 12 | + |
| 13 | +## Related Issues |
| 14 | + |
| 15 | +### Issue #1764 - CRITICAL INSIGHT! |
| 16 | +**Problem:** Race condition in StreamableHTTPServerTransport with SSE connections hanging. |
| 17 | + |
| 18 | +**Root Cause:** Zero-buffer memory streams + `tg.start_soon()` pattern causes deadlock: |
| 19 | +- `send()` blocks until `receive()` is called on zero-buffer streams |
| 20 | +- When sender is faster than receiver task initializes, deadlock occurs |
| 21 | +- Responses with 1-2 items work, 3+ items deadlock (timing dependent!) |
| 22 | + |
| 23 | +**Fix:** Either increase buffer size OR use `await tg.start()` to ensure receiver ready. |
| 24 | + |
| 25 | +**Relevance to #262:** The `stdio_client` uses EXACTLY this pattern: |
| 26 | +```python |
| 27 | +read_stream_writer, read_stream = anyio.create_memory_object_stream(0) # Zero buffer! |
| 28 | +write_stream, write_stream_reader = anyio.create_memory_object_stream(0) # Zero buffer! |
| 29 | +# ... |
| 30 | +tg.start_soon(stdout_reader) # Not awaited! |
| 31 | +tg.start_soon(stdin_writer) # Not awaited! |
| 32 | +``` |
| 33 | + |
| 34 | +This could cause the exact hang described in #262 if the server responds before |
| 35 | +the client's receive loop is ready to receive! |
| 36 | + |
| 37 | +## Comprehensive Test Results |
| 38 | + |
| 39 | +### Test Categories and Results |
| 40 | + |
| 41 | +| Category | Tests | Result | Notes | |
| 42 | +|----------|-------|--------|-------| |
| 43 | +| Basic tool call | 1 | PASS | Simple scenario works | |
| 44 | +| Buffering tests | 3 | PASS | Flush/no-flush, unbuffered all work | |
| 45 | +| 0-capacity streams | 3 | PASS | Rapid responses, notifications work | |
| 46 | +| Interleaved notifications | 2 | PASS | Server notifications during tool work | |
| 47 | +| Sampling during tool | 1 | PASS | Bidirectional communication works | |
| 48 | +| Timing races | 2 | PASS | Small delays don't trigger | |
| 49 | +| Big delays (2-3 sec) | 1 | PASS | Server delays don't cause hang | |
| 50 | +| Instant response | 1 | PASS | Immediate response works | |
| 51 | +| Burst responses | 1 | PASS | 20 rapid log messages handled | |
| 52 | +| Slow callbacks | 2 | PASS | Slow logging/message handlers work | |
| 53 | +| Many iterations | 1 | PASS | 50 rapid iterations all succeed | |
| 54 | +| Concurrent sessions | 2 | PASS | Multiple parallel sessions work | |
| 55 | +| Stress tests | 2 | PASS | 30 sequential sessions work | |
| 56 | +| Patched SDK | 3 | PASS | Delays in SDK don't trigger | |
| 57 | +| CPU pressure | 1 | PASS | Heavy CPU load doesn't trigger | |
| 58 | +| Raw subprocess | 2 | PASS | Direct pipe communication works | |
| 59 | +| Preemptive response | 1 | PASS | Unbuffered immediate response works | |
| 60 | + |
| 61 | +**Total: 34 tests, all passing** |
| 62 | + |
| 63 | +### Theories Tested |
| 64 | + |
| 65 | +1. **Stdout Buffering** - Server not flushing stdout after responses |
| 66 | + - Result: NOT the cause - works with and without flush |
| 67 | + |
| 68 | +2. **0-Capacity Streams** - stdio_client uses unbuffered streams (capacity 0) |
| 69 | + - Result: NOT the cause on this platform - works in test environment |
| 70 | + |
| 71 | +3. **Interleaved Notifications** - Server sending log notifications during tool execution |
| 72 | + - Result: NOT the cause - notifications handled correctly |
| 73 | + |
| 74 | +4. **Bidirectional Communication** - Server requesting sampling during tool execution |
| 75 | + - Result: NOT the cause - bidirectional works |
| 76 | + |
| 77 | +5. **Timing/Race Conditions** - Small delays in server response |
| 78 | + - Result: Could not reproduce with various delay patterns |
| 79 | + |
| 80 | +6. **Big Delays (2-3 seconds)** - As comments suggest |
| 81 | + - Result: NOT the cause - big delays work fine |
| 82 | + |
| 83 | +7. **Slow Callbacks** - Message handler/logging callback that blocks |
| 84 | + - Result: NOT the cause - slow callbacks work |
| 85 | + |
| 86 | +8. **Zero-buffer + start_soon race** (from #1764) |
| 87 | + - Result: Could not reproduce, but this remains the most likely cause |
| 88 | + |
| 89 | +9. **CPU Pressure** - Heavy CPU load exposing timing issues |
| 90 | + - Result: NOT the cause on this platform |
| 91 | + |
| 92 | +10. **Raw Subprocess Communication** - Direct pipe handling |
| 93 | + - Result: Works correctly, issue is not in pipe handling |
| 94 | + |
| 95 | +## Environment Notes |
| 96 | +- Testing on: Linux (not WSL) |
| 97 | +- Python: 3.11.14 |
| 98 | +- Using anyio for async |
| 99 | +- All 34 tests pass consistently |
| 100 | + |
| 101 | +## Conclusions |
| 102 | + |
| 103 | +### Why We Cannot Reproduce |
| 104 | +The issue appears to be **highly environment-specific**: |
| 105 | +1. **WSL-specific behavior** - The original reporter experienced this on WSL Ubuntu, not native Linux/Windows |
| 106 | +2. **Timing-dependent** - Debugger stepping makes it disappear, suggesting a very narrow timing window |
| 107 | +3. **Platform-specific pipe behavior** - WSL has different I/O characteristics than native Linux |
| 108 | + |
| 109 | +### Most Likely Root Cause |
| 110 | +Based on issue #1764, the most likely cause is the **zero-buffer memory stream + start_soon pattern**: |
| 111 | +1. `stdio_client` creates 0-capacity streams |
| 112 | +2. Reader/writer tasks are started with `start_soon` (not awaited) |
| 113 | +3. In certain environments (WSL), the timing allows responses to arrive before the receive loop is ready |
| 114 | +4. This causes the send to block indefinitely (deadlock) |
| 115 | + |
| 116 | +### Potential Fixes (to be verified on WSL) |
| 117 | +1. **Increase stream buffer size** - Change from `anyio.create_memory_object_stream(0)` to `anyio.create_memory_object_stream(1)` or higher |
| 118 | +2. **Use `await tg.start()`** - Ensure receive loop is ready before returning from context manager |
| 119 | +3. **Add synchronization** - Use an Event to signal when receive loop is ready |
| 120 | + |
| 121 | +## Files Created |
| 122 | +- `tests/issues/test_262_tool_call_hang.py` - Comprehensive test suite (34 tests) |
| 123 | +- `tests/issues/reproduce_262_standalone.py` - Standalone reproduction script |
| 124 | +- `ISSUE_262_INVESTIGATION.md` - This investigation document |
| 125 | + |
| 126 | +## Recommendations |
| 127 | +1. **For users experiencing this issue:** |
| 128 | + - Try running on native Linux or Windows instead of WSL |
| 129 | + - Check if adding a small delay after session creation helps |
| 130 | + |
| 131 | +2. **For maintainers:** |
| 132 | + - Consider changing stream buffer size in `stdio_client` from 0 to 1 |
| 133 | + - Consider using `await tg.start()` pattern instead of `start_soon` for critical tasks |
| 134 | + - Test changes specifically on WSL Ubuntu to verify fix |
| 135 | + |
| 136 | +3. **For further investigation:** |
| 137 | + - Need WSL Ubuntu environment to reproduce |
| 138 | + - Could try patching `stdio_client` to use `anyio.create_memory_object_stream(1)` and test |
0 commit comments