Skip to content

fix: gateway status card shows 'not running' when no platforms connected#1757

Closed
skspade wants to merge 3 commits intonesquena:masterfrom
skspade:fix/gateway-status-card-no-platforms
Closed

fix: gateway status card shows 'not running' when no platforms connected#1757
skspade wants to merge 3 commits intonesquena:masterfrom
skspade:fix/gateway-status-card-no-platforms

Conversation

@skspade
Copy link
Copy Markdown
Contributor

@skspade skspade commented May 6, 2026

fix: gateway status card shows 'not running' when no platforms connected

Thinking Path

  • Hermes WebUI aims for near 1:1 parity with what the gateway reports at runtime
  • The gateway status card in the settings panel tells users whether their gateway
    process is alive and which messaging platforms are connected
  • The running signal was bool(identity_map) — a truthiness check on connected
    platform sessions
  • An empty identity_map means zero platforms connected, not that the gateway is
    down, but bool({}) is False — so the UI showed "Gateway not running" when
    the gateway was actually alive and just had no active platform sessions
  • The fix sources the running signal from agent_health.build_agent_health_payload()
    which reads gateway.status metadata directly, then adds a configured tri-state
    so the frontend can distinguish "not running" from "not configured at all"

What Changed

api/routes.py — handler for /api/gateway/status now calls
build_agent_health_payload() and reads the tri-state alive field (True /
False / None) instead of bool(identity_map). Falls back to the identity_map
heuristic when alive is None (gateway not configured — e.g. WebUI-only
deployments). Adds a new configured key to the JSON response.

static/panels.jsloadGatewayStatus() checks r.configured first;
shows an amber "Gateway not configured" state when the gateway isn't set up.
The red "Gateway not running" state is now gated behind !r.configured being
false, so it only shows when the gateway is configured but its process is down.

tests/test_gateway_status_agent_health.py (new, 247 lines, 10 tests) —
Regression suite covering:

  • alive=True + empty sessions → running=true, configured=true
  • alive=False + empty sessions → running=false, configured=true
  • alive=None (not configured) → fallback to identity_map, configured=false
  • alive=True + populated sessions → platforms extracted correctly
  • alive=False + sessions present → still running=false (agent_health is authoritative)
  • Corrupted sessions.json → platforms=[], running derived from agent_health
  • Blank/missing platform fields → platforms=[]
  • Response always includes running and configured keys
  • last_active is empty when no sessions path exists

Why It Matters

Users who deploy Hermes with the gateway process but haven't connected any
messaging platforms yet see a red "not running" indicator and may waste time
debugging a non-issue. Users who deploy WebUI without the gateway at all see
the same misleading state. The fix gives three distinct states: running (green),
not running (red, gateway configured but process down), and not configured
(amber, no gateway setup expected).

Verification

$ python -m pytest tests/test_gateway_status_agent_health.py -v --timeout=60
============================= 10 passed in 1.19s ==============================

Manual verification:

  • WebUI with gateway running, zero platforms connected → amber "not configured"
    or green (depending on gateway.status metadata), not red "not running"
  • WebUI with gateway running, platforms connected → green, platforms listed
  • WebUI with gateway process down → red "Gateway not running"
  • WebUI-only deployment (no gateway) → amber "Gateway not configured"

Risks / Follow-ups

  • Low risk. The alive=None fallback path preserves the existing
    bool(identity_map) behavior for deployments where agent_health module
    isn't available (e.g. WebUI-only), so this is strictly additive.
  • The configured key is new in the JSON response; external consumers of
    /api/gateway/status that don't expect it will ignore it silently.
  • No follow-ups required — the tri-state fully covers the states the gateway
    can be in.

Model Used

  • Provider: DeepSeek (custom)
  • Model: DeepSeek-V4-Pro
  • Assisted via Hermes Agent with terminal access for test execution

Use agent_health.build_agent_health_payload() as the authoritative
running signal instead of bool(identity_map). An empty identity_map
means zero connected messaging platforms, not that the gateway is down.

Falls back to identity_map heuristic when agent_health module is unavailable
(e.g. WebUI-only deployments).
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Reading api/routes.py:3290-3322 on origin/master against the diff at HEAD, plus api/agent_health.py:66-127 to confirm the contract — the diagnosis is correct: an empty identity_map means "zero connected platforms," not "gateway down." Switching to build_agent_health_payload() as the authoritative signal is the right call and matches what panels.js:5414-5430 already does for the gateway card on the System tab.

A few observations after reading the agent-health helper end to end:

The except Exception fallback is effectively dead code

Looking at api/agent_health.py:73-83 and :88-97, build_agent_health_payload() is intentionally exception-safe — every internal failure (gateway module missing, runtime status read failing, get_running_pid raising) is caught and turned into an alive: None payload:

try:
    gateway_status = _gateway_status_module()
except Exception as exc:
    return {
        "alive": None,
        "checked_at": checked_at,
        "details": {"reason": "gateway_status_unavailable", ...},
    }

So in the new try/except in routes.py:3187-3206, the except branch will basically never fire in production — the function would have to fail before returning a dict, which it's designed not to do. The test test_gateway_status_handles_agent_health_unavailable_fallback_to_sessions only passes because it monkeypatches the helper to throw via a generator-throw trick; it isn't testing a real code path.

This isn't blocking — defence in depth is fine — but a more honest version of "fall back to identity_map heuristic" is to key off alive is None, not exceptions:

health = build_agent_health_payload()
alive = health.get("alive")
if alive is True:
    running = True
elif alive is False:
    running = False
else:  # alive is None → gateway not configured / unavailable
    running = bool(identity_map)

That keeps the WebUI-only-with-stale-sessions case behaving the way the original heuristic did, and makes the tri-state semantics explicit.

Tri-state semantics worth surfacing

alive=None is documented at api/agent_health.py:69-71 as "no gateway metadata/status is available, so this WebUI setup is probably not configured with a separate gateway process." The current diff collapses that to running=False, which means a WebUI-only install will continue to render "Gateway not running" with a red dot in panels.js:5417. That matches the pre-fix behaviour but is arguably also wrong in the opposite direction — the gateway isn't running, but it's also not meant to be. Out of scope for this PR; calling it out so a follow-up issue can decide whether to hide the card entirely or label it "not configured" when alive is None.

One small test gap

The current AC4 covers "alive=true + sessions populated" and AC1 covers "alive=true + empty identity_map" (the actual bug). What's missing is the symmetric old-bug regression: pre-fix, running was bool(identity_map); post-fix, even if the gateway is freshly-started with alive=true and identity_map={}, last_active is still "" because of the if running and sessions_path.exists() guard at routes.py:3211. With the fix, running=True but last_active="" — worth one explicit assertion so a future refactor can't silently flip that back.

Verdict

Logic is correct, tests cover the bug, the change is small and surgical. The two suggestions above are polish, not blockers. If you'd rather land as-is and address the alive is None case in a follow-up, that's also reasonable.

…unning

- Backend: return `configured` field alongside `running`. When
  alive=None (no gateway metadata), configured=false with fallback to
  identity_map heuristic.
- Frontend: amber "Gateway not configured" when configured=false,
  red "Gateway not running" only when configured but process is down,
  green "Running" when both true.
- Replace dead try/except fallback with explicit tri-state check on
  health["alive"].
- Add regression test for last_active guard when alive=true and
  identity_map is empty.

All 87 gateway-related tests pass.
@skspade
Copy link
Copy Markdown
Contributor Author

skspade commented May 6, 2026

Addressed all three points in 4aca1d2:

  1. Tri-state handling: Backend now returns configured alongside running. alive=Noneconfigured=false, and the frontend renders an amber "Gateway not configured" badge for that state rather than a misleading red "not running." The configured=false path still falls back to bool(identity_map) for running to preserve the WebUI-only-with-stale-sessions behavior you noted.

  2. Dead except Exception removed: Replaced with the explicit tri-state branch on health["alive"] you suggested — alive is True / alive is False / alive is None → fallback. Much clearer.

  3. Regression test gap: Added test_gateway_status_last_active_empty_when_alive_and_no_sessions_path asserting last_active == "" when alive=true and identity_map={}.

All 87 gateway tests pass. The "what to do when alive is None in the UI" question (hide card entirely vs. amber badge) can definitely be a follow-up — went with amber for now since it gives the user actionable info without being alarmist.

nesquena-hermes added a commit that referenced this pull request May 6, 2026
v0.51.14 — 4-PR contributor batch (#1756, #1757, #1760, #1761)
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @skspade — this shipped in v0.51.14 (commit 2106083) as part of a 4-PR full-sweep contributor batch. Stage rebased your branch onto current master, ran the full pre-release gate (4649 pytest, browser tests, Opus advisor verdict SHIP all 4), and merged via release PR #1763.

Your PR's CHANGELOG entry was a redundant addition since stage writes the v0.51.14 changelog at stamp time; the conflict on CHANGELOG was auto-resolved by dropping that one commit. The actual code fix shipped intact.

GitHub didn't auto-close because the merge commit only references the squash-merged stage branch, not your fork's commit directly — closing manually for hygiene.

Live now on existing installs after git pull + restart.

Release notes: https://github.com/nesquena/hermes-webui/releases/tag/v0.51.14

iosub pushed a commit to iosub/HERMES-hermes-webui2 that referenced this pull request May 6, 2026
iosub pushed a commit to iosub/HERMES-hermes-webui2 that referenced this pull request May 6, 2026
nesquena#1757, nesquena#1760, nesquena#1761)

Constituent PRs:
- nesquena#1760 (@ai-ag2026) preserve pending user turn on stream errors. Closes nesquena#1361.
- nesquena#1761 (@dso2ng) scope terminal stream cleanup to owner session. Refs nesquena#1694.
  AUTO-FIX applied: restored !INFLIGHT[S.session.session_id] disjunct in
  _setActivePaneIdleIfOwner (regression introduced by helper centralization).
- nesquena#1756 (@ng-technology-llc) isolate profile cookie per webui instance. Closes nesquena#803.
- nesquena#1757 (@skspade) tri-state gateway status (alive: True/False/None).

Tests: 4642 → 4662 collected (+20). 4649 passed, 9 skipped (test-isolation
prong-2 noise), 3 xpassed, 0 failed in 152s.

Pre-release verification:
- All 4 PRs CI-green or rebased clean (nesquena#1757 had stale base; CHANGELOG conflict
  auto-resolved by dropping the PR's redundant entry).
- node -c clean on static/messages.js + static/panels.js.
- 11/11 browser API endpoints PASS.
- Pre-stamp re-fetch: all PR heads match local rebases.
- Opus advisor: SHIP, all 5 verification questions clean, 0 MUST-FIX, 0 SHOULD-FIX.
- Two NICE-TO-HAVE coverage gaps absorbed in-release:
  (1) test_sprint36.py asserts !INFLIGHT[...] disjunct in helper body
  (2) test_issue1361_cancel_data_loss.py adds structural-grep test to pin
      _materialize_pending_user_turn_before_error call sites at error branches.

Closes nesquena#803, nesquena#1361, nesquena#1694.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants