Skip to content

GatewayManager port auto-increment spiral -- writeProfilePort silently corrupts config.yaml on every restart #607

@andy0532

Description

@andy0532

Summary

The built-in GatewayManager in dist/server/services/hermes/gateway-manager.js has a writeProfilePort() method that modifies the user's Hermes Agent config.yaml on every WebUI restart, causing a port auto-increment spiral that eventually leads to a complete port clash between the gateway and the WebUI's own Node server.

Root Cause

The resolvePort() method is called during bootstrap via startAll()resolvePort(). It always calls writeProfilePort() — even when the configured port is perfectly valid and available. This is visible at line ~260:

// Port is free — but still writes config
this.writeProfilePort(name, port, host);

writeProfilePort() does:

  1. yaml.load() the config
  2. Destroys the extra structure
  3. Dumps the entire YAML back via yaml.dump() — losing all comments, ordering, and any fields the GatewayManager doesn't know about

Meanwhile, detectStatus() has a fragile three-way verification (PID file + process alive + health check on configured port). If the Gateway was restarted for any reason and the health check fails briefly, detectStatus returns running: falseresolvePort() thinks the port is in use → increments by 1 → writes new config → calls hermes gateway restart on the new port.

This cascades on every WebUI restart, producing a chain like:

Port 8645 is in use, reassigning to 8646  (PID 748)
Port 8646 is in use, reassigning to 8647  (PID 738)
Port 8647 is in use, reassigning to 8648  (PID 760)
→ WebUI Node server EADDRINUSE on 8648 → port dead

Impact

  • Gateway config silently drifts from the standard 8642 to arbitrary ports up to 8648
  • WebUI becomes unreachable — systemctl status shows "running" but port 8648 is held by the gateway (Python) process, not the Node WebUI
  • config.yaml is corrupted by repeated yaml.load() → yaml.dump() cycles — comments, field ordering, and unmanaged keys are lost
  • npm upgrade does NOT fix it — the logic is in the build, and a fresh install has the same behavior

Affected Version

v0.5.16 (and presumably all versions shipping gateway-manager.js with resolvePort() → writeProfilePort())

Expected Behavior

For environments where the Hermes Agent Gateway is independently managed (systemd, docker, or manual gateway run), the GatewayManager should:

  1. Detect the running gateway's actual port (via PID + lsof / procfs)
  2. Use that port for upstream proxy — without modifying config.yaml
  3. Never write to ~/.hermes/config.yaml

Suggested Fix

writeProfilePort() should be a no-op when the gateway is already running and responsive. Or better: make resolvePort() read-only — allocate ports in-memory only, without side effects on the filesystem.

The upgrade-safe workaround applied in our environment (via startup wrapper script):

# Insert early return at the top of writeProfilePort
sed -i '/^    writeProfilePort(name, port, host) {/a\        return;' dist/server/services/hermes/gateway-manager.js

Evidence

Full server.log trace of the spiral:

2026-05-10 12:27:46  PID 748  Port 8645 is in use, reassigning to 8646
2026-05-10 13:20:52  PID 738  Port 8646 is in use, reassigning to 8647
2026-05-10 17:38:22  PID 760  Port 8647 is in use, reassigning to 8648

After the last step, the EADDRINUSE error from the Node server:

{"err":{"type":"Error","message":"listen EADDRINUSE: address already in use 0.0.0.0:8648","port":8648},"msg":"Unhandled rejection"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions