fix: HTTP server restart recovery with graceful client reconnection by zenyr · Pull Request #72 · zenyr/mcp-pty

zenyr · 2025-10-22T12:47:08Z

Summary

Issue: MCP client panics when HTTP server restarts (same port) because client doesn't reconnect gracefully.

Scenario:

MCP HTTP server running
MCP client connects, acquires session
Server restarts (port unchanged)
Client unaware of restart, calls tools/list
Server returns 404 (stale sessionId)
Expected: Client reconnects gracefully, resumes operation
Actual: Client panics (SDK error)

Solution: MCP server cooperatively responds to stale sessions + client handles 404 recovery

Changes

Port Conflict Handling (50a9c64)
- Catch EADDRINUSE on server startup
- Display helpful error with port and resolution (--port flag or MCP_PTY_PORT env var)
Session Init Race Condition (9070546)
- Explicit wait loop (100ms) when concurrent requests initialize session
- Prevents 'Server not initialized' 500 errors
- Timeout reduced from 5s → 100ms (microseconds in practice)
404 Recovery (Stale Session) (2109330)
- Remove deferred init on 404 to prevent race condition
- Server immediately ready when issuing new sessionId
- Client receives 404 + new sessionId header, SDK auto-updates

Tests

✅ 8/8 HTTP transport tests passing
✅ E2E recovery flow validates stale session → 404 → new session → operational
⚠️ E2E test has incorrect assumptions (see comments)

- Wrap Bun.serve with try-catch for EADDRINUSE errors - Display clear error message with port and resolution steps - Guide users to use --port flag or MCP_PTY_PORT env var - Add await to async calls in index.ts for proper error propagation

- Implement explicit wait loop when another request initializes session - Add final session status verification before handleRequest - Add comprehensive client outlive server test scenarios - All tests pass with zero 500 'Server not initialized' errors

Concurrent session initialization is extremely rare (must happen within same millisecond). Real-world scenario: server.connect() completes in microseconds. Reduced timeout: 50*100ms to 10*10ms = 100ms maximum wait. All tests still pass.

…bleHTTPClientTransport - Test concurrent client requests immediately after session initialization (race condition) - Test multiple isolated MCP clients on same server instance - Test real MCP client with stale session recovery (opencode scenario) - All tests use @modelcontextprotocol/sdk StreamableHTTPClientTransport - 9/9 tests pass with zero 500 errors Implemented by: Haiku 4.5

…recovery - Remove deferred initialization for 404 responses to prevent race condition - When client receives 404 + new sessionId, server is now ready immediately - Prevents 400 'Server not initialized' errors on client retry - Client recovery after stale session now works without errors Implemented by: Haiku 4.5

Document findings from HTTP session recovery test: - 404 recovery mechanism works at HTTP level - Deferred initialization implemented - Issue identified: transport.handleRequest() returns 400 after server restart - Root cause: StreamableHTTPServerTransport may bind to HTTP request lifetime - Created 7 test files for manual E2E validation - All 9 unit tests pass, E2E RPC calls fail with 'Server not initialized' Next: Investigate transport factory pattern or SDK design

…lifecycle - Return Server instance from startHttpServer for test lifecycle control - Refactor E2E test to manage server inline without subprocess spawning - Use Bun.serve().stop() for graceful server shutdown between test phases - Test validates server restart recovery: kill server, reconnect with new session, resume operation - Passes in ~1 second with clean async/await pattern - Remove obsolete spawn-based server helper file

- Document problem: previous E2E test was too heavy with subprocess spawning - Explain solution: return Server instance from startHttpServer, manage lifecycle inline - Show test pattern with 4 phases (normal > down > recovery with new session > functional) - Validate complete recovery flow: stale sessionId > 404 > auto-update > success - Performance: ~1 second total, no process management overhead - Lessons learned on Bun.serve() lifecycle and subprocess avoidance

Delete redundant test files lacking proper Bun.serve lifecycle encapsulation. Keep only http-session-recovery-e2e.test.ts which properly manages server lifecycle.

zenyr · 2025-10-23T01:44:54Z

E2E Test Issue: Incorrect Assumptions

Problem: Current E2E test doesn't validate actual recovery scenario. Test maintains single client instance throughout, but recovery requires graceful reconnection.

Current Test Behavior:

// Phase 2-3: Kill + restart server
await killSrv();        // Manually deletes sessionManager sessions
await startSrv(6426);   // Restart server

// Phase 4: Client calls with stale sessionId
await call("2");        // Logs 404 but doesn't validate graceful reconnection
state.s2 = tr.sessionId;  // Just records new sessionId after 404

Why This is Wrong:

Manual sessionManager.deleteSession() is fake server restart (not actual)
Client instance never breaks connection/reconnects — it just keeps existing connection
404 recovery assumed, but not validated that client remains functional
Real scenario: client should survive server restart, not just absorb new sessionId

Expected Behavior (per scenario):

Server restarts → old sessionId invalid → 404
Client receives 404 + new sessionId header
SDK auto-updates transport.sessionId
Client continues to work with new session (no panic, no errors)

Test Should Validate:

Client instance persistent throughout test (no teardown/reconnect)
Before restart: normal operation ✅
After restart: tools/list etc. returns 404 on first call ✅
SDK gracefully updates sessionId (SDK responsibility, test assumption ok)
After recovery: client calls succeed again ✅
Critical: Zero "Server not initialized" or panic errors from client

Correct Test Pattern:

Keep client instance alive across server kill/restart
Validate first call post-restart yields 404 (stale session)
Validate subsequent calls succeed (client auto-updated sessionId)
Timeout/error handling for client retries is SDK responsibility

Action: Next commit should fix test to validate above pattern + add assertions for graceful reconnection.

zenyr added 4 commits October 22, 2025 21:46

feat: add graceful HTTP port conflict handling

50a9c64

- Wrap Bun.serve with try-catch for EADDRINUSE errors - Display clear error message with port and resolution steps - Guide users to use --port flag or MCP_PTY_PORT env var - Add await to async calls in index.ts for proper error propagation

zenyr force-pushed the feature/port-conflict-handling branch from 62936a0 to 86697a6 Compare October 22, 2025 13:03

zenyr added 3 commits October 22, 2025 22:06

zenyr force-pushed the feature/port-conflict-handling branch from e1a1c74 to d617252 Compare October 22, 2025 13:48

zenyr added 3 commits October 22, 2025 22:48

refactor: remove duplicate e2e test scripts from PR #72

8ec2c50

Delete redundant test files lacking proper Bun.serve lifecycle encapsulation. Keep only http-session-recovery-e2e.test.ts which properly manages server lifecycle.

docs: update agentlog for e2e test cleanup

580e744

zenyr marked this pull request as draft October 22, 2025 14:10

zenyr changed the title ~~feat: graceful HTTP port conflict handling~~ fix: HTTP server restart recovery with graceful client reconnection Oct 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: HTTP server restart recovery with graceful client reconnection#72

fix: HTTP server restart recovery with graceful client reconnection#72
zenyr wants to merge 10 commits intodevelopfrom
feature/port-conflict-handling

zenyr commented Oct 22, 2025 •

edited

Loading

Uh oh!

zenyr commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zenyr commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Uh oh!

zenyr commented Oct 23, 2025

E2E Test Issue: Incorrect Assumptions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zenyr commented Oct 22, 2025 •

edited

Loading