Skip to content

fix: HTTP server restart recovery with graceful client reconnection#72

Draft
zenyr wants to merge 10 commits intodevelopfrom
feature/port-conflict-handling
Draft

fix: HTTP server restart recovery with graceful client reconnection#72
zenyr wants to merge 10 commits intodevelopfrom
feature/port-conflict-handling

Conversation

@zenyr
Copy link
Owner

@zenyr zenyr commented Oct 22, 2025

Summary

Issue: MCP client panics when HTTP server restarts (same port) because client doesn't reconnect gracefully.

Scenario:

  1. MCP HTTP server running
  2. MCP client connects, acquires session
  3. Server restarts (port unchanged)
  4. Client unaware of restart, calls tools/list
  5. Server returns 404 (stale sessionId)
  6. Expected: Client reconnects gracefully, resumes operation
  7. Actual: Client panics (SDK error)

Solution: MCP server cooperatively responds to stale sessions + client handles 404 recovery

Changes

  1. Port Conflict Handling (50a9c64)

    • Catch EADDRINUSE on server startup
    • Display helpful error with port and resolution (--port flag or MCP_PTY_PORT env var)
  2. Session Init Race Condition (9070546)

    • Explicit wait loop (100ms) when concurrent requests initialize session
    • Prevents 'Server not initialized' 500 errors
    • Timeout reduced from 5s → 100ms (microseconds in practice)
  3. 404 Recovery (Stale Session) (2109330)

    • Remove deferred init on 404 to prevent race condition
    • Server immediately ready when issuing new sessionId
    • Client receives 404 + new sessionId header, SDK auto-updates

Tests

  • ✅ 8/8 HTTP transport tests passing
  • ✅ E2E recovery flow validates stale session → 404 → new session → operational
  • ⚠️ E2E test has incorrect assumptions (see comments)

zenyr added 4 commits October 22, 2025 21:46
- Wrap Bun.serve with try-catch for EADDRINUSE errors
- Display clear error message with port and resolution steps
- Guide users to use --port flag or MCP_PTY_PORT env var
- Add await to async calls in index.ts for proper error propagation
- Implement explicit wait loop when another request initializes session
- Add final session status verification before handleRequest
- Add comprehensive client outlive server test scenarios
- All tests pass with zero 500 'Server not initialized' errors
Concurrent session initialization is extremely rare (must happen within same millisecond).
Real-world scenario: server.connect() completes in microseconds.
Reduced timeout: 50*100ms to 10*10ms = 100ms maximum wait.
All tests still pass.
…bleHTTPClientTransport

- Test concurrent client requests immediately after session initialization (race condition)
- Test multiple isolated MCP clients on same server instance
- Test real MCP client with stale session recovery (opencode scenario)
- All tests use @modelcontextprotocol/sdk StreamableHTTPClientTransport
- 9/9 tests pass with zero 500 errors

Implemented by: Haiku 4.5
@zenyr zenyr force-pushed the feature/port-conflict-handling branch from 62936a0 to 86697a6 Compare October 22, 2025 13:03
zenyr added 3 commits October 22, 2025 22:06
…recovery

- Remove deferred initialization for 404 responses to prevent race condition
- When client receives 404 + new sessionId, server is now ready immediately
- Prevents 400 'Server not initialized' errors on client retry
- Client recovery after stale session now works without errors

Implemented by: Haiku 4.5
Document findings from HTTP session recovery test:
- 404 recovery mechanism works at HTTP level
- Deferred initialization implemented
- Issue identified: transport.handleRequest() returns 400 after server restart
- Root cause: StreamableHTTPServerTransport may bind to HTTP request lifetime
- Created 7 test files for manual E2E validation
- All 9 unit tests pass, E2E RPC calls fail with 'Server not initialized'

Next: Investigate transport factory pattern or SDK design
…lifecycle

- Return Server instance from startHttpServer for test lifecycle control
- Refactor E2E test to manage server inline without subprocess spawning
- Use Bun.serve().stop() for graceful server shutdown between test phases
- Test validates server restart recovery: kill server, reconnect with
  new session, resume operation
- Passes in ~1 second with clean async/await pattern
- Remove obsolete spawn-based server helper file
@zenyr zenyr force-pushed the feature/port-conflict-handling branch from e1a1c74 to d617252 Compare October 22, 2025 13:48
zenyr added 3 commits October 22, 2025 22:48
- Document problem: previous E2E test was too heavy with subprocess spawning
- Explain solution: return Server instance from startHttpServer, manage lifecycle inline
- Show test pattern with 4 phases (normal > down > recovery with new session > functional)
- Validate complete recovery flow: stale sessionId > 404 > auto-update > success
- Performance: ~1 second total, no process management overhead
- Lessons learned on Bun.serve() lifecycle and subprocess avoidance
Delete redundant test files lacking proper Bun.serve lifecycle
encapsulation. Keep only http-session-recovery-e2e.test.ts which
properly manages server lifecycle.
@zenyr zenyr marked this pull request as draft October 22, 2025 14:10
@zenyr zenyr changed the title feat: graceful HTTP port conflict handling fix: HTTP server restart recovery with graceful client reconnection Oct 23, 2025
@zenyr
Copy link
Owner Author

zenyr commented Oct 23, 2025

E2E Test Issue: Incorrect Assumptions

Problem: Current E2E test doesn't validate actual recovery scenario. Test maintains single client instance throughout, but recovery requires graceful reconnection.

Current Test Behavior:

// Phase 2-3: Kill + restart server
await killSrv();        // Manually deletes sessionManager sessions
await startSrv(6426);   // Restart server

// Phase 4: Client calls with stale sessionId
await call("2");        // Logs 404 but doesn't validate graceful reconnection
state.s2 = tr.sessionId;  // Just records new sessionId after 404

Why This is Wrong:

  1. Manual sessionManager.deleteSession() is fake server restart (not actual)
  2. Client instance never breaks connection/reconnects — it just keeps existing connection
  3. 404 recovery assumed, but not validated that client remains functional
  4. Real scenario: client should survive server restart, not just absorb new sessionId

Expected Behavior (per scenario):

  • Server restarts → old sessionId invalid → 404
  • Client receives 404 + new sessionId header
  • SDK auto-updates transport.sessionId
  • Client continues to work with new session (no panic, no errors)

Test Should Validate:

  1. Client instance persistent throughout test (no teardown/reconnect)
  2. Before restart: normal operation ✅
  3. After restart: tools/list etc. returns 404 on first call ✅
  4. SDK gracefully updates sessionId (SDK responsibility, test assumption ok)
  5. After recovery: client calls succeed again ✅
  6. Critical: Zero "Server not initialized" or panic errors from client

Correct Test Pattern:

  • Keep client instance alive across server kill/restart
  • Validate first call post-restart yields 404 (stale session)
  • Validate subsequent calls succeed (client auto-updated sessionId)
  • Timeout/error handling for client retries is SDK responsibility

Action: Next commit should fix test to validate above pattern + add assertions for graceful reconnection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant