Skip to content

fix: add backend startup heartbeat liveness probe#114

Merged
zouyonghe merged 13 commits intoAstrBotDevs:mainfrom
zouyonghe:fix/backend-startup-heartbeat-liveness
Apr 10, 2026
Merged

fix: add backend startup heartbeat liveness probe#114
zouyonghe merged 13 commits intoAstrBotDevs:mainfrom
zouyonghe:fix/backend-startup-heartbeat-liveness

Conversation

@zouyonghe
Copy link
Copy Markdown
Member

@zouyonghe zouyonghe commented Apr 10, 2026

Summary

The desktop shell currently waits for backend readiness by polling /api/stat/start-time until a fixed overall timeout expires. That breaks down when the packaged backend spends a long time installing dependencies before the HTTP server starts listening: the process is still doing useful startup work, but the shell sees only repeated HTTP failures and eventually kills it as if startup had hung.

This change adds a launcher-owned startup heartbeat so the shell can distinguish between "not ready yet" and "no longer making progress". The Python launcher now refreshes a heartbeat file while startup is still in progress, and the Rust readiness loop treats a fresh heartbeat from the current child PID as evidence that startup is alive even when HTTP is not ready yet. If the heartbeat goes stale before readiness arrives, startup still fails fast. The packaged-mode hard timeout is also raised from 5 minutes to 15 minutes so long-running dependency installation has more headroom, while an idle-timeout guard prevents endless waits.

The heartbeat is scoped to the current child PID and requires state == \"starting\" to avoid stale files from previous runs or other instances being treated as current progress. The new timeout configuration and heartbeat path are documented for operators.

Testing

  • cargo test backend::
  • python3 -m py_compile scripts/backend/templates/launch_backend.py

Summary by Sourcery

Introduce a backend startup heartbeat mechanism so the desktop can distinguish active startup from hung startup while waiting for HTTP readiness.

New Features:

  • Add a launcher-driven backend startup heartbeat file that tracks the current backend PID and startup state to inform readiness checks.

Bug Fixes:

  • Prevent the desktop from prematurely killing packaged backends that are still actively starting up but have not yet exposed their HTTP readiness endpoint.

Enhancements:

  • Extend backend startup timeout handling with a configurable idle-timeout that fails fast when the heartbeat stops progressing before readiness.
  • Plumb startup heartbeat paths through launch plans and backend runtime configuration, and standardize use of the ASTRBOT_ROOT environment variable via a shared constant.
  • Improve readiness timeout logging with heartbeat age information to aid diagnosing slow or stuck startups.

Documentation:

  • Document new startup heartbeat and timeout environment variables, including defaults and ranges, for operators.

Tests:

  • Add Rust unit tests for startup heartbeat parsing, freshness evaluation, and failure modes during readiness polling.
  • Add Python unit tests for the backend launcher heartbeat writer, including atomic file replacement, warning behavior, and exit handling.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="206-214" />
<code_context>
+    Some(UNIX_EPOCH + Duration::from_millis(heartbeat.updated_at_ms))
+}
+
+fn startup_heartbeat_timestamp_is_fresh(
+    updated_at: Option<SystemTime>,
+    now: SystemTime,
+    max_age: Duration,
+) -> bool {
+    updated_at
+        .map(|updated_at| now.duration_since(updated_at).unwrap_or(Duration::ZERO))
+        .is_some_and(|age| age <= max_age)
+}
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Avoid treating future heartbeat timestamps as implicitly fresh.

Here, a future heartbeat timestamp (due to clock skew or a malformed file) makes `duration_since` return an error, which you coerce to `Duration::ZERO` and therefore treat as freshest. That inverts the intended semantics. Consider treating the `Err` case as stale/invalid (e.g., returning `false` when `duration_since` fails or skipping the heartbeat) instead of mapping it to zero.

```suggestion
fn startup_heartbeat_timestamp_is_fresh(
    updated_at: Option<SystemTime>,
    now: SystemTime,
    max_age: Duration,
) -> bool {
    updated_at
        .and_then(|updated_at| now.duration_since(updated_at).ok())
        .is_some_and(|age| age <= max_age)
}
```
</issue_to_address>

### Comment 2
<location path="src-tauri/src/backend/readiness.rs" line_range="85-92" />
<code_context>
+            };
+
+            if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
+                if let Some(updated_at) =
+                    read_startup_heartbeat_updated_at(heartbeat_path, child_pid)
+                {
+                    if last_startup_heartbeat_at
+                        .map(|last_seen| updated_at > last_seen)
+                        .unwrap_or(true)
+                    {
+                        last_startup_heartbeat_at = Some(updated_at);
+                    }
+                }
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Consider clearing the last-seen heartbeat when the file no longer matches the current process.

If `read_startup_heartbeat_updated_at` returns `None` (PID mismatch or non-`starting` state), `last_startup_heartbeat_at` keeps its previous value. That allows an old heartbeat from a previous process to be treated as stale for the current process and trigger an idle-timeout error. To avoid this, consider setting `last_startup_heartbeat_at` to `None` whenever the read returns `None`, so staleness is only checked against heartbeats from the current PID.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a heartbeat mechanism to improve the reliability of the backend startup monitoring process. The backend now periodically writes a JSON heartbeat file containing its PID and status, while the Tauri frontend monitors this file during the readiness check to distinguish between a slow-starting process and a hung one. Changes include new environment variables for heartbeat paths and idle timeouts, a Python implementation of the heartbeat loop in the launcher script, and updated Rust logic to validate heartbeat freshness during the HTTP readiness probe. I have no feedback to provide as there are no review comments to assess.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The startup heartbeat path resolution is duplicated in both runtime::backend_readiness_config and BackendState::spawn_backend; consider computing the path once and reusing it for both the env var and the readiness config to avoid divergence if the resolution rules change later.
  • In StartupHeartbeatFile, using a String for state makes it easy for typos or unexpected values to slip through; consider switching this to an enum with serde tags (e.g. Starting/Stopping) so only valid states are accepted at deserialization time.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The startup heartbeat path resolution is duplicated in both `runtime::backend_readiness_config` and `BackendState::spawn_backend`; consider computing the path once and reusing it for both the env var and the readiness config to avoid divergence if the resolution rules change later.
- In `StartupHeartbeatFile`, using a `String` for `state` makes it easy for typos or unexpected values to slip through; consider switching this to an enum with `serde` tags (e.g. `Starting`/`Stopping`) so only valid states are accepted at deserialization time.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@SourceryAI SourceryAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Both read_startup_heartbeat_updated_at and the Python write_startup_heartbeat swallow all errors silently, which could make misconfiguration or file permission issues very hard to diagnose; consider logging at least a single warning (throttled) when JSON parsing or file writes repeatedly fail.
  • In startup_heartbeat_timestamp_is_fresh you recompute SystemTime::now() at each call site (and separately when logging timeout details); if you care about precise age calculations and consistent diagnostics, consider threading a single now value through the readiness loop and into the log helper instead of calling SystemTime::now() multiple times.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Both `read_startup_heartbeat_updated_at` and the Python `write_startup_heartbeat` swallow all errors silently, which could make misconfiguration or file permission issues very hard to diagnose; consider logging at least a single warning (throttled) when JSON parsing or file writes repeatedly fail.
- In `startup_heartbeat_timestamp_is_fresh` you recompute `SystemTime::now()` at each call site (and separately when logging timeout details); if you care about precise age calculations and consistent diagnostics, consider threading a single `now` value through the readiness loop and into the log helper instead of calling `SystemTime::now()` multiple times.

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="198-204" />
<code_context>
+    Stopping,
+}
+
+fn read_startup_heartbeat_updated_at(path: &Path, expected_pid: u32) -> Option<SystemTime> {
+    let payload = fs::read_to_string(path).ok()?;
+    let heartbeat: StartupHeartbeatFile = serde_json::from_str(&payload).ok()?;
+    if heartbeat.pid != expected_pid || heartbeat.state != StartupHeartbeatState::Starting {
+        return None;
+    }
+    Some(UNIX_EPOCH + Duration::from_millis(heartbeat.updated_at_ms))
+}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against overflow when converting `updated_at_ms` into `SystemTime`.

On some platforms, `UNIX_EPOCH + Duration::from_millis(...)` will panic if `updated_at_ms` is large enough that the resulting `SystemTime` is out of range (e.g., due to corruption or manual edits). Using `UNIX_EPOCH.checked_add(Duration::from_millis(...))` and returning `None` on overflow would avoid a crash and treat a bad heartbeat as "no heartbeat" instead.
</issue_to_address>

Hi @zouyonghe! 👋

Thanks for trying out Sourcery by commenting with @sourcery-ai review! 🚀

Install the sourcery-ai bot to get automatic code reviews on every pull request ✨

Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In the startup readiness loop, next_startup_heartbeat_at clears the previous timestamp whenever the current read returns None, which means a disappeared/invalid heartbeat file won't trigger the idle-timeout fast-fail and will instead only be caught by the global timeout—consider treating a transition from Some to None as stale (or at least logging it) to surface regressions in the heartbeat writer more clearly.
  • The Python write_startup_heartbeat helper fully swallows all exceptions, including on the initial write; since the Rust side silently ignores missing/invalid heartbeats, a one-time warning to stderr or a debug log when the first write fails would make diagnosing misconfigured ASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATH or permission issues much easier without impacting normal operation.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the startup readiness loop, `next_startup_heartbeat_at` clears the previous timestamp whenever the current read returns `None`, which means a disappeared/invalid heartbeat file won't trigger the idle-timeout fast-fail and will instead only be caught by the global timeout—consider treating a transition from `Some` to `None` as stale (or at least logging it) to surface regressions in the heartbeat writer more clearly.
- The Python `write_startup_heartbeat` helper fully swallows all exceptions, including on the initial write; since the Rust side silently ignores missing/invalid heartbeats, a one-time warning to stderr or a debug log when the first write fails would make diagnosing misconfigured `ASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATH` or permission issues much easier without impacting normal operation.

## Individual Comments

### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="129-138" />
<code_context>
+    return Path(raw)
+
+
+def write_startup_heartbeat(path: Path, state: str) -> None:
+    try:
+        path.parent.mkdir(parents=True, exist_ok=True)
+        payload = {
+            "pid": os.getpid(),
+            "state": state,
+            "updated_at_ms": int(time.time() * 1000),
+        }
+        temp_path = path.with_name(f"{path.name}.tmp")
+        temp_path.write_text(
+            json.dumps(payload, separators=(",", ":")),
+            encoding="utf-8",
+        )
+        temp_path.replace(path)
+    except Exception:
+        return
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Swallowing all exceptions on heartbeat writes can make diagnosing misconfiguration difficult.

This unconditional `except Exception` prevents visibility into issues like permission errors, invalid paths, or serialization failures. Please log at least a brief message (e.g., to stderr, optionally behind an env flag or limited to expected exception types) so that heartbeat write failures are diagnosable when readiness based on this file doesn’t work.

Suggested implementation:

```python
def resolve_startup_heartbeat_path() -> Path | None:
    raw = os.environ.get(STARTUP_HEARTBEAT_ENV, "").strip()
    if not raw:
        return None
    return Path(raw)


def write_startup_heartbeat(path: Path, state: str) -> None:
    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        payload = {
            "pid": os.getpid(),
            "state": state,
            "updated_at_ms": int(time.time() * 1000),
        }
        temp_path = path.with_name(f"{path.name}.tmp")
        temp_path.write_text(
            json.dumps(payload, separators=(",", ":")),
            encoding="utf-8",
        )
        temp_path.replace(path)
    except Exception as exc:
        if os.environ.get("STARTUP_HEARTBEAT_DEBUG"):
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path!s}: {exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return


import atexit

```

If you prefer logging only specific, expected exceptions (e.g., `OSError`, `json.JSONDecodeError`) instead of all exceptions, you can narrow the `except` clause accordingly. Also, if you want the debug env var name to be a constant alongside `STARTUP_HEARTBEAT_ENV`, introduce something like `STARTUP_HEARTBEAT_DEBUG_ENV = "STARTUP_HEARTBEAT_DEBUG"` near its definition and replace the hard-coded string in `os.environ.get(...)` with that constant for consistency.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In the readiness loop you call SystemTime::now() multiple times per iteration (for heartbeat freshness and for timeout logging); consider capturing now once at the top of the loop and passing it through to keep timing consistent and avoid redundant syscalls.
  • For StartupHeartbeatFile you might want to add #[serde(deny_unknown_fields)] to fail fast on malformed heartbeat files with unexpected fields rather than silently ignoring potential format drift or operator mistakes.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the readiness loop you call `SystemTime::now()` multiple times per iteration (for heartbeat freshness and for timeout logging); consider capturing `now` once at the top of the loop and passing it through to keep timing consistent and avoid redundant syscalls.
- For `StartupHeartbeatFile` you might want to add `#[serde(deny_unknown_fields)]` to fail fast on malformed heartbeat files with unexpected fields rather than silently ignoring potential format drift or operator mistakes.

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="226-232" />
<code_context>
+    Invalidated(SystemTime),
+}
+
+fn read_startup_heartbeat_updated_at(path: &Path, expected_pid: u32) -> Option<SystemTime> {
+    let payload = fs::read_to_string(path).ok()?;
+    let heartbeat: StartupHeartbeatFile = serde_json::from_str(&payload).ok()?;
+    if heartbeat.pid != expected_pid || heartbeat.state != StartupHeartbeatState::Starting {
+        return None;
+    }
+    Some(UNIX_EPOCH + Duration::from_millis(heartbeat.updated_at_ms))
+}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against potential overflow when converting `updated_at_ms` into a `SystemTime`.

`updated_at_ms` is a `u64` from an external file and is added to `UNIX_EPOCH` via `Duration::from_millis`. For very large values (e.g., near `u64::MAX`), `UNIX_EPOCH + Duration` can panic. Since this function already returns `Option`, consider using `SystemTime::UNIX_EPOCH.checked_add(Duration::from_millis(heartbeat.updated_at_ms))?` or clamping to a reasonable maximum, so malformed/adversarial heartbeat files don’t cause a panic.
</issue_to_address>

### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="155" />
<code_context>
+        return False
+
+
+def start_startup_heartbeat() -> None:
+    heartbeat_path = resolve_startup_heartbeat_path()
+    if heartbeat_path is None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `start_startup_heartbeat` to use a small helper class that encapsulates the shared state and related behavior instead of nested functions with `nonlocal`.

You can reduce the inline complexity in `start_startup_heartbeat` by encapsulating the shared state (`warning_emitted`, `stop_event`, `heartbeat_path`) and behavior into a small helper class instead of nested functions with `nonlocal`.

This keeps all behavior the same (including the first write, warning semantics, atexit behavior, and periodic loop), but makes the control flow easier to follow and test.

```python
class StartupHeartbeat:
    def __init__(self, path: Path, interval: float) -> None:
        self._path = path
        self._interval = interval
        self._stop_event = threading.Event()
        self._warning_emitted = False

    def _write(self, state: str, *, warn_on_error: bool) -> bool:
        ok = write_startup_heartbeat(
            self._path,
            state,
            warn_on_error=warn_on_error,
        )
        if not ok:
            self._warning_emitted = True
        return ok

    def start(self) -> None:
        # initial write: always warn on error
        self._write("starting", warn_on_error=True)
        atexit.register(self.stop)
        threading.Thread(
            target=self._loop,
            name="astrbot-startup-heartbeat",
            daemon=True,
        ).start()

    def stop(self) -> None:
        self._stop_event.set()
        # respect prior failure: only warn if we haven't already
        self._write("stopping", warn_on_error=not self._warning_emitted)

    def _loop(self) -> None:
        while not self._stop_event.wait(self._interval):
            self._write("starting", warn_on_error=not self._warning_emitted)
```

`start_startup_heartbeat` then becomes a thin, declarative wrapper:

```python
def start_startup_heartbeat() -> None:
    heartbeat_path = resolve_startup_heartbeat_path()
    if heartbeat_path is None:
        return

    StartupHeartbeat(
        heartbeat_path,
        STARTUP_HEARTBEAT_INTERVAL_SECONDS,
    ).start()
```

This removes nested closures and `nonlocal`, keeps lifetime management in one cohesive unit, and makes the module-level startup sequence (`configure_*`, `preload_*`, `start_startup_heartbeat`) straightforward to skim.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="155-164" />
<code_context>
+class StartupHeartbeat:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Reconsider when to suppress warnings on heartbeat write failures

Because `_warning_emitted` is set to `True` on the first write failure, all later writes (including the final `"stopping"` write in `stop`) use `warn_on_error=False`. A transient early failure can therefore silence more critical stop‑time failures. If you’re trying to avoid log spam, consider only suppressing repeated failures after at least one successful write (or rate‑limiting logs), while still ensuring a warning is emitted if the stop‑time write fails.

Suggested implementation:

```python
class StartupHeartbeat:
    def __init__(self, path: Path, interval_seconds: float) -> None:
        self._path = path
        self._interval_seconds = interval_seconds
        self._stop_event = threading.Event()
        # Tracks whether we have ever successfully written a heartbeat
        self._had_successful_write = False
        # Tracks whether we've already emitted a warning for the *current* failure sequence
        self._warning_emitted = False

    def _write(self, state: str, *, warn_on_error: bool) -> bool:
        # Only suppress warnings for repeated failures that occur *after* at least one
        # successful write. Before any success, every failure (including stop-time)
        # will still warn.
        effective_warn_on_error = warn_on_error and (
            not self._warning_emitted or not self._had_successful_write
        )

        ok = write_startup_heartbeat(
            self._path,
            state,
            warn_on_error=effective_warn_on_error,
        )

        if ok:
            # Any success resets the failure sequence and re-enables warning for
            # the next failure.
            self._had_successful_write = True
            self._warning_emitted = False
        elif effective_warn_on_error:
            # Mark that we've already warned for this failure sequence so
            # subsequent failures (without an intervening success) can be
            # suppressed to avoid log spam.
            self._warning_emitted = True

        return ok

```

These changes assume the previous implementation of `_write` matched the `SEARCH` block exactly. If it differs in your repo, adapt the `SEARCH` section to fully cover the old `_write` definition so it gets replaced cleanly.

Also, this implementation preserves the existing call sites’ semantics: callers still pass `warn_on_error` as before. The new logic ensures:
- Before any successful write, every failure (including the stop-time write) will log a warning.
- After at least one successful write, the first failure will log a warning, and subsequent failures in the same failure sequence will be suppressed until the next success.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="85" />
<code_context>
                 }
+            };
+
+            if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
+                match next_startup_heartbeat_at(
+                    last_startup_heartbeat_at,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting the startup heartbeat handling into a dedicated stateful helper function so the main readiness loop only coordinates high-level steps instead of inlining the state machine logic.

You can shrink the cognitive load of the main loop by pushing the heartbeat state machine into a dedicated helper and hiding `StartupHeartbeatObservation` and the timestamp freshness logic behind it. That keeps all behavior but reduces branching in the loop and makes the heartbeat logic more locally testable.

For example, introduce a small state struct and helper:

```rust
#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatState {
    last_seen_at: Option<SystemTime>,
    logged_fresh: bool,
}

impl StartupHeartbeatState {
    fn new() -> Self {
        Self {
            last_seen_at: None,
            logged_fresh: false,
        }
    }
}

enum StartupHeartbeatStep {
    Continue(StartupHeartbeatState),
    Failed(String),
}
```

Then encapsulate all the heartbeat-related logic, including `next_startup_heartbeat_at` and `startup_heartbeat_timestamp_is_fresh`, into a single function:

```rust
fn step_startup_heartbeat(
    heartbeat_path: &Path,
    child_pid: u32,
    now: SystemTime,
    idle_timeout: Duration,
    mut state: StartupHeartbeatState,
) -> StartupHeartbeatStep {
    match next_startup_heartbeat_at(
        state.last_seen_at,
        read_startup_heartbeat_updated_at(heartbeat_path, child_pid),
    ) {
        StartupHeartbeatObservation::Missing => {
            state.last_seen_at = None;
            StartupHeartbeatStep::Continue(state)
        }
        StartupHeartbeatObservation::Observed(updated_at) => {
            state.last_seen_at = Some(updated_at);
            if startup_heartbeat_timestamp_is_fresh(state.last_seen_at, now, idle_timeout) {
                if !state.logged_fresh {
                    append_desktop_log(
                        "backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
                    );
                    state.logged_fresh = true;
                }
                StartupHeartbeatStep::Continue(state)
            } else {
                append_desktop_log(
                    "backend startup heartbeat went stale before HTTP dashboard became ready",
                );
                StartupHeartbeatStep::Failed(format!(
                    "Backend startup heartbeat went stale after {}ms without HTTP readiness.",
                    idle_timeout.as_millis()
                ))
            }
        }
        StartupHeartbeatObservation::Invalidated(previous) => {
            let heartbeat_age_ms = now
                .duration_since(previous)
                .ok()
                .map(|age| age.as_millis().to_string())
                .unwrap_or_else(|| "unknown".to_string());
            append_desktop_log(&format!(
                "backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}",
            ));
            StartupHeartbeatStep::Failed(
                "Backend startup heartbeat disappeared or became invalid before HTTP readiness."
                    .to_string(),
            )
        }
    }
}
```

Now the main loop only coordinates, instead of open-coding the state machine:

```rust
let mut heartbeat_state = StartupHeartbeatState::new();

loop {
    let (http_status, tcp_reachable) =
        self.probe_backend_readiness(&readiness.path, readiness.probe_timeout_ms);
    if matches!(http_status, Some(status_code) if (200..400).contains(&status_code)) {
        return Ok(());
    }

    let now = SystemTime::now();
    let child_pid = /* existing child lock / try_wait logic */;

    if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
        match step_startup_heartbeat(
            heartbeat_path,
            child_pid,
            now,
            startup_idle_timeout,
            heartbeat_state,
        ) {
            StartupHeartbeatStep::Continue(next_state) => {
                heartbeat_state = next_state;
            }
            StartupHeartbeatStep::Failed(reason) => {
                return Err(reason);
            }
        }
    }

    // existing tcp_reachable + timeout + sleep logic
}
```

This keeps all functionality (including use of `StartupHeartbeatObservation`, `startup_heartbeat_timestamp_is_fresh`, and timeout snapshotting) but:

- centralizes heartbeat behavior and logging in one place  
- removes multiple flags (`startup_heartbeat_logged`, `last_startup_heartbeat_at`) from the loop body  
- makes it easier to unit test the heartbeat behavior with different sequences of observations.
</issue_to_address>

### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="155" />
<code_context>
+        return False
+
+
+class StartupHeartbeat:
+    def __init__(self, path: Path, interval_seconds: float) -> None:
+        self._path = path
</code_context>
<issue_to_address>
**issue (complexity):** Consider replacing the StartupHeartbeat class with a simple heartbeat loop function and local flags to make the lifecycle and warning logic more explicit and easier to follow.

The `StartupHeartbeat` class adds indirection and scattered state for behavior that can be expressed with a single loop function and a couple of local flags, while keeping all semantics (first-failure warnings, reset after success, always warn on stop, fire‑and‑forget thread).

You can collapse the class into a functional pattern and keep the warning behavior explicit:

```python
def heartbeat_loop(path: Path, interval: float, stop_event: threading.Event) -> None:
    had_success = False
    warned_on_failure = False

    # initial write
    ok = write_startup_heartbeat(path, "starting", warn_on_error=True)
    if ok:
        had_success = True
        warned_on_failure = False
    else:
        warned_on_failure = True

    while not stop_event.wait(interval):
        # warn on first failure, or before any success
        warn_now = (not warned_on_failure) or (not had_success)
        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
        if ok:
            had_success = True
            warned_on_failure = False
        elif warn_now:
            warned_on_failure = True
```

Then wire it up in `start_startup_heartbeat` without the class:

```python
def start_startup_heartbeat() -> None:
    heartbeat_path = resolve_startup_heartbeat_path()
    if heartbeat_path is None:
        return

    stop_event = threading.Event()

    def on_exit() -> None:
        stop_event.set()
        # always warn on stop failure
        write_startup_heartbeat(heartbeat_path, "stopping", warn_on_error=True)

    atexit.register(on_exit)

    threading.Thread(
        target=heartbeat_loop,
        args=(heartbeat_path, STARTUP_HEARTBEAT_INTERVAL_SECONDS, stop_event),
        name="astrbot-startup-heartbeat",
        daemon=True,
    ).start()
```

This:

- Removes the `StartupHeartbeat` object and private state fields.
- Keeps all existing behavior (initial write, periodic writes, warning throttling, final “stopping” write with warnings).
- Makes the lifecycle and warning logic visible in one place, reducing cognitive load and indirection while preserving the new feature.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="176" />
<code_context>
     }
 }
+
+#[derive(serde::Deserialize)]
+#[serde(deny_unknown_fields)]
+struct StartupHeartbeatFile {
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the new heartbeat and timeout logic by collapsing helper types, letting the tracker be mutated directly, and extracting child PID handling so the main readiness loop reads more linearly.

You can reduce the new complexity without changing behavior by collapsing some of the extra types/indirection and making the loop more linear.

### 1. Simplify `step_startup_heartbeat` API and remove `StartupHeartbeatStep`

`StartupHeartbeatStep` is a bespoke `Result` and the tracker is small. Let `step_startup_heartbeat` mutate the tracker via `&mut` and return `Result<(), String>`:

```rust
#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatTracker {
    last_seen_at: Option<SystemTime>,
    logged_fresh: bool,
}

impl StartupHeartbeatTracker {
    fn new() -> Self {
        Self {
            last_seen_at: None,
            logged_fresh: false,
        }
    }
}

fn step_startup_heartbeat(
    heartbeat_path: &Path,
    child_pid: u32,
    now: SystemTime,
    idle_timeout: Duration,
    state: &mut StartupHeartbeatTracker,
) -> Result<(), String> {
    let prev = state.last_seen_at;
    let current = read_startup_heartbeat_updated_at(heartbeat_path, child_pid);

    match (prev, current) {
        (Some(prev), None) => {
            let heartbeat_age_ms = now
                .duration_since(prev)
                .ok()
                .map(|age| age.as_millis().to_string())
                .unwrap_or_else(|| "unknown".to_string());
            append_desktop_log(&format!(
                "backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}",
            ));
            Err("Backend startup heartbeat disappeared or became invalid before HTTP readiness.".to_string())
        }
        (None, None) => {
            state.last_seen_at = None;
            Ok(())
        }
        (_, Some(current)) => {
            let updated_at = match prev {
                Some(prev) if current <= prev => prev,
                _ => current,
            };
            state.last_seen_at = Some(updated_at);

            if startup_heartbeat_timestamp_is_fresh(state.last_seen_at, now, idle_timeout) {
                if !state.logged_fresh {
                    append_desktop_log(
                        "backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
                    );
                    state.logged_fresh = true;
                }
                Ok(())
            } else {
                append_desktop_log(
                    "backend startup heartbeat went stale before HTTP dashboard became ready",
                );
                Err(format!(
                    "Backend startup heartbeat went stale after {}ms without HTTP readiness.",
                    idle_timeout.as_millis()
                ))
            }
        }
    }
}
```

Then the call site becomes simpler and more idiomatic:

```rust
if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
    step_startup_heartbeat(
        heartbeat_path,
        child_pid,
        now,
        startup_idle_timeout,
        &mut startup_heartbeat_state,
    )?;
}
```

This removes `StartupHeartbeatStep` and `StartupHeartbeatObservation` while keeping all behavior.

### 2. Inline `heartbeat_updated_at_ms_to_system_time`

This helper is only used once and just forwards to `checked_add`. You can inline it into `read_startup_heartbeat_updated_at` and drop the function:

```rust
fn read_startup_heartbeat_updated_at(path: &Path, expected_pid: u32) -> Option<SystemTime> {
    let payload = fs::read_to_string(path).ok()?;
    let heartbeat: StartupHeartbeatFile = serde_json::from_str(&payload).ok()?;
    if heartbeat.pid != expected_pid || heartbeat.state != StartupHeartbeatState::Starting {
        return None;
    }
    UNIX_EPOCH.checked_add(Duration::from_millis(heartbeat.updated_at_ms))
}
```

Tests for `heartbeat_updated_at_ms_to_system_time` can be rewritten to assert against this behavior directly if still needed.

### 3. Remove `ReadinessTimeoutSnapshot`

`ReadinessTimeoutSnapshot` is just a tuple wrapper used in one place. You can pass the fields directly and compute the heartbeat age inside `log_backend_readiness_timeout`:

```rust
fn log_backend_readiness_timeout(
    &self,
    timeout: Duration,
    ready_http_path: &str,
    probe_timeout_ms: u64,
    now: SystemTime,
    last_http_status: Option<u16>,
    tcp_reachable: bool,
    last_startup_heartbeat_at: Option<SystemTime>,
) {
    let last_http_status_text = last_http_status
        .map(|status| status.to_string())
        .unwrap_or_else(|| "none".to_string());
    let startup_heartbeat_age_ms = last_startup_heartbeat_at
        .and_then(|updated_at| now.duration_since(updated_at).ok())
        .map(|age| age.as_millis().to_string())
        .unwrap_or_else(|| "none".to_string());

    append_desktop_log(&format!(
        "backend HTTP readiness check timed out after {}ms: backend_url={}, path={}, probe_timeout_ms={}, tcp_reachable={}, last_http_status={}, startup_heartbeat_age_ms={}",
        timeout.as_millis(),
        self.backend_url,
        ready_http_path,
        probe_timeout_ms,
        tcp_reachable,
        last_http_status_text,
        startup_heartbeat_age_ms,
    ));
}
```

Call site:

```rust
self.log_backend_readiness_timeout(
    limit,
    &readiness.path,
    readiness.probe_timeout_ms,
    now,
    http_status,
    ever_tcp_reachable,
    startup_heartbeat_state.last_seen_at,
);
```

This removes one struct and a layer of indirection without changing logging.

### 4. Extract child PID / status polling into a helper

The inner block in the loop can be pulled into a small helper to flatten the loop:

```rust
fn child_pid_or_error(&self) -> Result<u32, String> {
    let mut guard = self
        .child
        .lock()
        .map_err(|_| "Backend process lock poisoned.".to_string())?;
    if let Some(child) = guard.as_mut() {
        let pid = child.id();
        match child.try_wait() {
            Ok(Some(status)) => {
                *guard = None;
                Err(format!("Backend process exited before becoming reachable: {status}"))
            }
            Ok(None) => Ok(pid),
            Err(error) => Err(format!("Failed to poll backend process status: {error}")),
        }
    } else {
        Err("Backend process is not running.".to_string())
    }
}
```

Loop body then reads more linearly:

```rust
let now = SystemTime::now();
let child_pid = self.child_pid_or_error()?;

if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
    step_startup_heartbeat(
        heartbeat_path,
        child_pid,
        now,
        startup_idle_timeout,
        &mut startup_heartbeat_state,
    )?;
}
```

These changes keep your new behavior (heartbeat tracking, logging, timeout info) but reduce the number of custom types and split logic, making the readiness loop and heartbeat handling easier to follow.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="115" />
<code_context>
         (http_status, tcp_reachable)
     }

+    fn child_pid_or_error(&self) -> Result<u32, String> {
+        let mut guard = self
+            .child
</code_context>
<issue_to_address>
**issue (complexity):** Consider splitting child PID retrieval from liveness checks and factoring out repeated time-delta logic to make the readiness and heartbeat flow clearer and easier to read.

You can simplify a couple of the new pieces without changing behavior.

### 1. Separate PID lookup from liveness checking

`child_pid_or_error` currently does three things: acquires the lock, checks liveness, and returns the PID. That makes every PID lookup implicitly mutate state and pay a `try_wait` cost.

You can split this into two helpers so the responsibilities are clearer, while keeping behavior identical in the caller:

```rust
fn child_pid(&self) -> Result<u32, String> {
    let guard = self
        .child
        .lock()
        .map_err(|_| "Backend process lock poisoned.".to_string())?;

    guard
        .as_ref()
        .map(|child| child.id())
        .ok_or_else(|| "Backend process is not running.".to_string())
}

fn ensure_child_alive(&self) -> Result<(), String> {
    let mut guard = self
        .child
        .lock()
        .map_err(|_| "Backend process lock poisoned.".to_string())?;

    if let Some(child) = guard.as_mut() {
        match child.try_wait() {
            Ok(Some(status)) => {
                *guard = None;
                Err(format!(
                    "Backend process exited before becoming reachable: {status}"
                ))
            }
            Ok(None) => Ok(()),
            Err(error) => Err(format!("Failed to poll backend process status: {error}")),
        }
    } else {
        Err("Backend process is not running.".to_string())
    }
}
```

Then in the loop you can make the lifecycle handling explicit:

```rust
let now = SystemTime::now();

self.ensure_child_alive()?;
let child_pid = self.child_pid()?;

if let Some(heartbeat_path) = readiness.startup_heartbeat_path.as_deref() {
    step_startup_heartbeat(
        heartbeat_path,
        child_pid,
        now,
        startup_idle_timeout,
        &mut startup_heartbeat_state,
    )?;
}
```

This keeps the same failure modes but makes “is the process alive?” a visible step instead of hidden inside “get me the PID.”

### 2. Centralize the `duration_since(...).ok()` / age math

You currently repeat the `duration_since(...).ok()` pattern both in `startup_heartbeat_timestamp_is_fresh` and in `log_backend_readiness_timeout`. A tiny helper reduces duplication and makes the freshness logic easier to scan:

```rust
fn ms_since(earlier: SystemTime, now: SystemTime) -> Option<u128> {
    now.duration_since(earlier).ok().map(|d| d.as_millis())
}
```

Use it in the tracker:

```rust
fn startup_heartbeat_timestamp_is_fresh(
    updated_at: Option<SystemTime>,
    now: SystemTime,
    max_age: Duration,
) -> bool {
    updated_at
        .and_then(|updated_at| ms_since(updated_at, now))
        .is_some_and(|age_ms| age_ms <= max_age.as_millis())
}
```

And in the timeout logging:

```rust
let startup_heartbeat_age_ms = last_startup_heartbeat_at
    .and_then(|updated_at| ms_since(updated_at, now))
    .map(|age| age.to_string())
    .unwrap_or_else(|| "none".to_string());
```

This also makes it easy to reuse age computation elsewhere if the heartbeat behavior evolves.

### 3. Minor cleanup in `step_startup_heartbeat`

The `(None, None)` branch explicitly sets `state.last_seen_at = None`, which is already known to be `None` and can be dropped for less noise:

```rust
( None, None ) => Ok(()),
```

This doesn’t change behavior but makes the match a bit easier to read.
</issue_to_address>

### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="155" />
<code_context>
+        return False
+
+
+def heartbeat_loop(
+    path: Path, interval_seconds: float, stop_event: threading.Event
+) -> None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring the heartbeat logic and heartbeat file writing into smaller, well‑named helpers to make the state and responsibilities easier to understand without changing behavior.

You can keep the current behavior but reduce the mental overhead a bit with some small refactors.

### 1. Make `heartbeat_loop` state easier to follow

The behavior is “log every failure until a first success, then log only the first failure in each new error streak.” The current flags encode this but are hard to read.

Renaming and encapsulating the condition helps without changing behavior:

```python
def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    # At least one successful write has happened.
    had_successful_write = False
    # A warning has already been emitted since the last successful write.
    warning_emitted_since_last_success = False

    def should_warn() -> bool:
        # Before first success: always warn on failure.
        # After first success: only warn on the first failure per error streak.
        return (not had_successful_write) or (not warning_emitted_since_last_success)

    ok = write_startup_heartbeat(path, "starting", warn_on_error=True)
    if ok:
        had_successful_write = True
    else:
        warning_emitted_since_last_success = True

    while not stop_event.wait(interval_seconds):
        warn_now = should_warn()
        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
        if ok:
            had_successful_write = True
            warning_emitted_since_last_success = False
        elif warn_now:
            warning_emitted_since_last_success = True
```

This keeps the same semantics but removes the ad‑hoc expression inside the loop and makes the states self-documenting.

### 2. Split `write_startup_heartbeat` into focused helpers

You can separate “what we write” from “how we write it” to shrink the responsibility of `write_startup_heartbeat`, while preserving external behavior:

```python
def build_heartbeat_payload(state: str) -> dict[str, object]:
    return {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }


def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
    temp_path = path.with_name(f"{path.name}.tmp")
    temp_path.write_text(
        json.dumps(payload, separators=(",", ":")),
        encoding="utf-8",
    )
    temp_path.replace(path)


def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        payload = build_heartbeat_payload(state)
        atomic_write_json(path, payload)
        return True
    except Exception as exc:
        if warn_on_error:
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path}: "
                f"{exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return False
```

Call sites stay the same, but reasoning about each piece gets easier: the payload shape, the atomic write pattern, and the error/reporting logic are now isolated.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The ASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATH env var name is duplicated between Rust (BACKEND_STARTUP_HEARTBEAT_PATH_ENV) and launch_backend.py (STARTUP_HEARTBEAT_ENV); consider adding an explicit comment or a small shared-generation mechanism to avoid future drift if the name changes.
  • In resolve_backend_startup_heartbeat_path, the precedence between root_dir and ASTRBOT_ROOT is subtle; adding a brief comment about why root_dir wins over the env var and how packaged vs dev roots are chosen would make future changes safer.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `ASTRBOT_BACKEND_STARTUP_HEARTBEAT_PATH` env var name is duplicated between Rust (`BACKEND_STARTUP_HEARTBEAT_PATH_ENV`) and `launch_backend.py` (`STARTUP_HEARTBEAT_ENV`); consider adding an explicit comment or a small shared-generation mechanism to avoid future drift if the name changes.
- In `resolve_backend_startup_heartbeat_path`, the precedence between `root_dir` and `ASTRBOT_ROOT` is subtle; adding a brief comment about why `root_dir` wins over the env var and how packaged vs dev roots are chosen would make future changes safer.

## Individual Comments

### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="162" />
<code_context>
+        return False
+
+
+def heartbeat_loop(
+    path: Path, interval_seconds: float, stop_event: threading.Event
+) -> None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the `heartbeat_loop` warning logic to use a single flag that tracks consecutive failures instead of multiple booleans and a helper function.

You can simplify `heartbeat_loop`’s state machine without changing behavior by tracking only “have we already warned for this run of consecutive failures?” instead of two booleans plus `should_warn()`.

Current behavior:
- Always warn on the first failure.
- After a success, warn on the first subsequent failure.
- Suppress repeated warnings for continuous failures.

That can be expressed with a single flag:

```python
def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    # Have we already emitted a warning for the current run of consecutive failures?
    warned_for_current_failure_run = False

    # Initial write
    ok = write_startup_heartbeat(path, "starting", warn_on_error=True)
    warned_for_current_failure_run = not ok

    while not stop_event.wait(interval_seconds):
        warn_now = not warned_for_current_failure_run
        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
        if ok:
            # Reset on success so the next failure warns again
            warned_for_current_failure_run = False
        elif warn_now:
            warned_for_current_failure_run = True
```

This keeps the “warn-on-first-failure-after-success” semantics, but removes the extra `had_successful_write` flag and nested `should_warn()` helper, making the control flow easier to follow and test.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In BackendState::wait_for_ready, you call both ensure_child_alive() and child_pid(), each taking the child lock separately on every loop iteration; consider refactoring to fetch and check the child process once per iteration (e.g., return the PID from ensure_child_alive) to avoid repeated locking and reduce contention.
  • In resolve_backend_startup_heartbeat_path, the ASTRBOT_ROOT environment variable name is hard-coded; it may be safer to reuse a shared constant (similar to other env-var constants in app_constants.rs) to avoid drift if the name ever changes.
  • In start_startup_heartbeat, the atexit handler writes a final stopping heartbeat without coordinating with the heartbeat thread beyond setting stop_event; if you want a stronger guarantee on the final state, consider optionally joining the thread or adding a small wait to reduce the race with in-flight writes.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `BackendState::wait_for_ready`, you call both `ensure_child_alive()` and `child_pid()`, each taking the child lock separately on every loop iteration; consider refactoring to fetch and check the child process once per iteration (e.g., return the PID from `ensure_child_alive`) to avoid repeated locking and reduce contention.
- In `resolve_backend_startup_heartbeat_path`, the `ASTRBOT_ROOT` environment variable name is hard-coded; it may be safer to reuse a shared constant (similar to other env-var constants in `app_constants.rs`) to avoid drift if the name ever changes.
- In `start_startup_heartbeat`, the `atexit` handler writes a final `stopping` heartbeat without coordinating with the heartbeat thread beyond setting `stop_event`; if you want a stronger guarantee on the final state, consider optionally joining the thread or adding a small wait to reduce the race with in-flight writes.

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="162-165" />
<code_context>
         let last_http_status_text = last_http_status
             .map(|status| status.to_string())
             .unwrap_or_else(|| "none".to_string());
+        let startup_heartbeat_age_ms = last_startup_heartbeat_at
+            .and_then(|updated_at| ms_since(updated_at, now))
+            .map(|age| age.to_string())
+            .unwrap_or_else(|| "none".to_string());
         append_desktop_log(&format!(
-            "backend HTTP readiness check timed out after {}ms: backend_url={}, path={}, probe_timeout_ms={}, tcp_reachable={}, last_http_status={}",
</code_context>
<issue_to_address>
**suggestion:** Timeout logging conflates "no heartbeat" with clock issues by reporting `"none"` for all `ms_since` failures.

If `last_startup_heartbeat_at` is `Some` but `ms_since` returns `None` (e.g. timestamp is in the future), we log `startup_heartbeat_age_ms=none`, which is indistinguishable from "no heartbeat seen". It would help diagnostics to distinguish these cases (e.g. log a specific value like "future" or include the raw timestamp when `ms_since` fails).

```suggestion
        let startup_heartbeat_age_ms = match last_startup_heartbeat_at {
            Some(updated_at) => match ms_since(updated_at, now) {
                Some(age) => age.to_string(),
                None => format!("future ({:?})", updated_at),
            },
            None => "none".to_string(),
        };
```
</issue_to_address>

### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="130" />
<code_context>
+    return Path(raw)
+
+
+def build_heartbeat_payload(state: str) -> dict[str, object]:
+    return {
+        "pid": os.getpid(),
</code_context>
<issue_to_address>
**issue (complexity):** Consider inlining the heartbeat JSON construction and atomic write logic directly into write_startup_heartbeat to avoid unnecessary helper indirection.

You can reduce indirection without changing behavior by inlining the tightly-coupled helpers into `write_startup_heartbeat`.

Right now:

```python
def build_heartbeat_payload(state: str) -> dict[str, object]:
    return {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }


def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
    temp_path = path.with_name(f"{path.name}.tmp")
    temp_path.write_text(
        json.dumps(payload, separators=(",", ":")),
        encoding="utf-8",
    )
    temp_path.replace(path)


def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        atomic_write_json(path, build_heartbeat_payload(state))
        return True
    except Exception as exc:
        ...
```

These helpers are only used together in a single call path and don’t provide reuse elsewhere, so they add mental hops without buying much.

You can keep the atomic write semantics and JSON format but collapse this into a single function:

```python
def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        temp_path = path.with_name(f"{path.name}.tmp")
        payload = {
            "pid": os.getpid(),
            "state": state,
            "updated_at_ms": int(time.time() * 1000),
        }
        temp_path.write_text(
            json.dumps(payload, separators=(",", ":")),
            encoding="utf-8",
        )
        temp_path.replace(path)
        return True
    except Exception as exc:
        if warn_on_error:
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path}: "
                f"{exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return False
```

Then you can remove `build_heartbeat_payload` and `atomic_write_json` entirely. This makes the heartbeat write path self-contained and easier to scan, while preserving all existing behavior (atomic replace, payload shape, logging).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

return Path(raw)


def build_heartbeat_payload(state: str) -> dict[str, object]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider inlining the heartbeat JSON construction and atomic write logic directly into write_startup_heartbeat to avoid unnecessary helper indirection.

You can reduce indirection without changing behavior by inlining the tightly-coupled helpers into write_startup_heartbeat.

Right now:

def build_heartbeat_payload(state: str) -> dict[str, object]:
    return {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }


def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
    temp_path = path.with_name(f"{path.name}.tmp")
    temp_path.write_text(
        json.dumps(payload, separators=(",", ":")),
        encoding="utf-8",
    )
    temp_path.replace(path)


def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        atomic_write_json(path, build_heartbeat_payload(state))
        return True
    except Exception as exc:
        ...

These helpers are only used together in a single call path and don’t provide reuse elsewhere, so they add mental hops without buying much.

You can keep the atomic write semantics and JSON format but collapse this into a single function:

def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        temp_path = path.with_name(f"{path.name}.tmp")
        payload = {
            "pid": os.getpid(),
            "state": state,
            "updated_at_ms": int(time.time() * 1000),
        }
        temp_path.write_text(
            json.dumps(payload, separators=(",", ":")),
            encoding="utf-8",
        )
        temp_path.replace(path)
        return True
    except Exception as exc:
        if warn_on_error:
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path}: "
                f"{exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return False

Then you can remove build_heartbeat_payload and atomic_write_json entirely. This makes the heartbeat write path self-contained and easier to scan, while preserving all existing behavior (atomic replace, payload shape, logging).

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The startup heartbeat freshness logic is based on SystemTime (including ms_since and describe_heartbeat_age), which makes it sensitive to wall‑clock adjustments; consider basing freshness and idle timeout decisions on a monotonic clock (Instant + durations) and only using SystemTime for logging.
  • step_startup_heartbeat treats any transition from Some(last_seen_at) to None (e.g., a single failed read/parse) as a hard error, which could make startup brittle in the face of transient filesystem/IO issues; you might want to require a grace period or consecutive failures before treating the heartbeat as permanently disappeared.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The startup heartbeat freshness logic is based on SystemTime (including ms_since and describe_heartbeat_age), which makes it sensitive to wall‑clock adjustments; consider basing freshness and idle timeout decisions on a monotonic clock (Instant + durations) and only using SystemTime for logging.
- step_startup_heartbeat treats any transition from Some(last_seen_at) to None (e.g., a single failed read/parse) as a hard error, which could make startup brittle in the face of transient filesystem/IO issues; you might want to require a grace period or consecutive failures before treating the heartbeat as permanently disappeared.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • The Rust tests startup_heartbeat_progress_is_fresh_for_recent_instant and startup_heartbeat_progress_is_not_fresh_when_stale create a TempDir and write a heartbeat file that is never read, which adds noise without exercising the heartbeat file parsing; consider removing the unused file I/O or adjusting the tests to cover read_startup_heartbeat_updated_at explicitly.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The Rust tests `startup_heartbeat_progress_is_fresh_for_recent_instant` and `startup_heartbeat_progress_is_not_fresh_when_stale` create a `TempDir` and write a heartbeat file that is never read, which adds noise without exercising the heartbeat file parsing; consider removing the unused file I/O or adjusting the tests to cover `read_startup_heartbeat_updated_at` explicitly.

## Individual Comments

### Comment 1
<location path="scripts/backend/templates/launch_backend.py" line_range="139-145" />
<code_context>
+    }
+
+
+def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
+    temp_path = path.with_name(f"{path.name}.tmp")
+    temp_path.write_text(
+        json.dumps(payload, separators=(",", ":")),
+        encoding="utf-8",
+    )
+    temp_path.replace(path)
+
+
</code_context>
<issue_to_address>
**suggestion:** Handle failures of `replace` in `atomic_write_json` to avoid leaving stale temp files silently.

If `temp_path.replace(path)` fails (e.g., permissions, transient FS issues, AV interference), the temp file is left behind and the error isn’t distinguished from the write step. Consider wrapping `replace` in a `try` and surfacing its failure in the heartbeat warning, and/or explicitly `unlink`ing the temp file on failure to avoid accumulating stale `.tmp` files.

Suggested implementation:

```python
def build_heartbeat_payload(state: str) -> dict[str, object]:
    return {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }


def atomic_write_json(path: Path, payload: dict[str, object]) -> None:
    temp_path = path.with_name(f"{path.name}.tmp")
    temp_path.write_text(
        json.dumps(payload, separators=(",", ":")),
        encoding="utf-8",
    )
    try:
        temp_path.replace(path)
    except Exception:
        # Best-effort cleanup to avoid accumulating stale temp files.
        try:
            temp_path.unlink(missing_ok=True)
        except Exception:
            # Ignore cleanup failures; original error is more important.
            pass
        raise


import atexit
import ctypes
import json
import os
import runpy
import sys
import threading
import time
from pathlib import Path

```

1. Wherever `atomic_write_json` is called for the heartbeat file, consider catching exceptions and including explicit context in any warnings/logs (e.g., distinguish between write vs. replace failures if you propagate more specific exception types or messages later).
2. If the project uses a logging framework in this file, you may want to log the replace failure inside the `except` block before re-raising, following existing logging conventions.
</issue_to_address>

### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="199-213" />
<code_context>
+        daemon=True,
+    )
+
+    def on_exit() -> None:
+        stop_event.set()
+        thread.join(timeout=STARTUP_HEARTBEAT_STOP_JOIN_TIMEOUT_SECONDS)
+        write_startup_heartbeat(heartbeat_path, "stopping", warn_on_error=True)
+
+    atexit.register(on_exit)
</code_context>
<issue_to_address>
**suggestion:** Guard `on_exit` against rare cases where the heartbeat thread never started.

If `thread.start()` fails (for example due to resource exhaustion) after `atexit.register(on_exit)` is called, `on_exit` will call `thread.join(...)` on a thread that never ran. CPython tolerates this today, but it’s brittle and could obscure startup failures. Consider either registering the `atexit` handler only after a successful `thread.start()`, or tracking a flag so `on_exit` skips `join` when the thread wasn’t started.

```suggestion
    stop_event = threading.Event()
    thread = threading.Thread(
        target=heartbeat_loop,
        args=(heartbeat_path, STARTUP_HEARTBEAT_INTERVAL_SECONDS, stop_event),
        name="astrbot-startup-heartbeat",
        daemon=True,
    )
    thread_started = False

    def on_exit() -> None:
        stop_event.set()
        if thread_started:
            thread.join(timeout=STARTUP_HEARTBEAT_STOP_JOIN_TIMEOUT_SECONDS)
        write_startup_heartbeat(heartbeat_path, "stopping", warn_on_error=True)

    atexit.register(on_exit)
    thread.start()
    thread_started = True
```
</issue_to_address>

### Comment 3
<location path="scripts/backend/templates/launch_backend.py" line_range="164" />
<code_context>
+        return False
+
+
+def heartbeat_loop(
+    path: Path, interval_seconds: float, stop_event: threading.Event
+) -> None:
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the heartbeat warning logic and inlining the payload builder to make the heartbeat flow easier to follow and maintain.

You can reduce complexity in two small, targeted ways without changing the overall behavior (periodic atomic JSON writes + “stopping” heartbeat).

### 1. Simplify heartbeat loop warning logic

The current `heartbeat_loop` has two flags plus a nested `should_warn` helper. You can replace that small state machine with a simple time-based throttling scheme, which is easier to read and reason about.

For example, throttle warnings to at most once every 10 seconds:

```python
WARNING_THROTTLE_SECONDS = 10.0


def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    last_warning_time: float | None = None

    while not stop_event.wait(interval_seconds):
        now = time.time()
        warn_now = (
            last_warning_time is None
            or (now - last_warning_time) >= WARNING_THROTTLE_SECONDS
        )

        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)

        if warn_now and not ok:
            last_warning_time = now
```

This keeps log volume bounded while removing `had_successful_write`, `warning_emitted_since_last_success`, and the nested function.

### 2. Inline `build_heartbeat_payload` into `write_startup_heartbeat`

`build_heartbeat_payload` only wraps a small dict and is only used from `write_startup_heartbeat`. Inlining it removes one layer in the call stack (`build_heartbeat_payload → atomic_write_json → write_startup_heartbeat → heartbeat_loop`), making the flow easier to follow.

```python
def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    payload = {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }

    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        atomic_write_json(path, payload)
        return True
    except Exception as exc:
        if warn_on_error:
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path}: "
                f"{exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return False
```

`atomic_write_json` remains a focused helper for the critical “write temp → replace” behavior, but the heartbeat logic is now more localized and straightforward.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

return False


def heartbeat_loop(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider simplifying the heartbeat warning logic and inlining the payload builder to make the heartbeat flow easier to follow and maintain.

You can reduce complexity in two small, targeted ways without changing the overall behavior (periodic atomic JSON writes + “stopping” heartbeat).

1. Simplify heartbeat loop warning logic

The current heartbeat_loop has two flags plus a nested should_warn helper. You can replace that small state machine with a simple time-based throttling scheme, which is easier to read and reason about.

For example, throttle warnings to at most once every 10 seconds:

WARNING_THROTTLE_SECONDS = 10.0


def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    last_warning_time: float | None = None

    while not stop_event.wait(interval_seconds):
        now = time.time()
        warn_now = (
            last_warning_time is None
            or (now - last_warning_time) >= WARNING_THROTTLE_SECONDS
        )

        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)

        if warn_now and not ok:
            last_warning_time = now

This keeps log volume bounded while removing had_successful_write, warning_emitted_since_last_success, and the nested function.

2. Inline build_heartbeat_payload into write_startup_heartbeat

build_heartbeat_payload only wraps a small dict and is only used from write_startup_heartbeat. Inlining it removes one layer in the call stack (build_heartbeat_payload → atomic_write_json → write_startup_heartbeat → heartbeat_loop), making the flow easier to follow.

def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    payload = {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }

    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        atomic_write_json(path, payload)
        return True
    except Exception as exc:
        if warn_on_error:
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path}: "
                f"{exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return False

atomic_write_json remains a focused helper for the critical “write temp → replace” behavior, but the heartbeat logic is now more localized and straightforward.

@zouyonghe
Copy link
Copy Markdown
Member Author

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src-tauri/src/backend/readiness.rs" line_range="185" />
<code_context>
+struct StartupHeartbeatTracker {
+    last_seen_at: Option<SystemTime>,
+    last_progress_at: Option<Instant>,
+    consecutive_invalid_reads: u8,
+    logged_fresh: bool,
+}
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the `StartupHeartbeatTracker` state machine by replacing the invalid-read counter with a boolean flag and inlining the freshness helper logic to make the control flow easier to follow.

You can keep all behavior but trim the state machine in `StartupHeartbeatTracker` by replacing the `u8` counter and helper around `Option<Instant>` with simpler fields.

### 1. Replace `consecutive_invalid_reads: u8` with a boolean

You only ever compare the counter to a small constant and reset it to `0`, so a boolean flag expresses the intent (“we already saw one invalid/missing read”) more directly:

```rust
#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatTracker {
    last_seen_at: Option<SystemTime>,
    last_progress_at: Option<Instant>,
    had_invalid_read: bool,
    logged_fresh: bool,
}
```

Then in `step_startup_heartbeat`:

```rust
match (previous, current) {
    (Some(previous), None) => {
        if !state.had_invalid_read {
            state.had_invalid_read = true;
            return Ok(());
        }

        let heartbeat_age_ms = describe_heartbeat_age(Some(previous), wall_now);
        append_desktop_log(&format!(
            "backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}"
        ));
        return Err(
            "Backend startup heartbeat disappeared or became invalid before HTTP readiness."
                .to_string(),
        );
    }
    (None, None) => {
        state.had_invalid_read = false;
        Ok(())
    }
    (_, Some(current)) => {
        state.had_invalid_read = false;
        // existing logic...
    }
}
```

This preserves the “one grace failure” behavior of `STARTUP_HEARTBEAT_INVALID_READ_THRESHOLD == 2` while making the disappearance handling easier to follow.

### 2. Inline `startup_heartbeat_progress_is_fresh` and drop the `Option` parameter

`startup_heartbeat_progress_is_fresh` is currently only used with `state.last_progress_at`, and immediately short-circuits on `None`. You can move that logic into `step_startup_heartbeat` and keep `last_progress_at` as an `Option<Instant>` without a separate helper:

```rust
// inside the (_, Some(current)) arm after updating state.last_progress_at:

let is_fresh = match state.last_progress_at {
    Some(last) => monotonic_now.duration_since(last) <= idle_timeout,
    None => true, // treat as fresh until first heartbeat progress is recorded
};

if is_fresh {
    if !state.logged_fresh {
        append_desktop_log(
            "backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
        );
        state.logged_fresh = true;
    }
    Ok(())
} else {
    append_desktop_log(
        "backend startup heartbeat went stale before HTTP dashboard became ready",
    );
    Err(format!(
        "Backend startup heartbeat went stale after {}ms without HTTP readiness.",
        idle_timeout.as_millis()
    ))
}
```

With this change you can delete:

```rust
fn startup_heartbeat_progress_is_fresh(
    last_progress_at: Option<Instant>,
    now: Instant,
    max_age: Duration,
) -> bool { ... }
```

These two adjustments keep all existing behavior (including the grace-miss semantics and stale detection) but make the state and control flow less intricate.
</issue_to_address>

### Comment 2
<location path="scripts/backend/templates/launch_backend.py" line_range="131" />
<code_context>
-main_file = APP_DIR / "main.py"
-if not main_file.is_file():
-    raise FileNotFoundError(f"Backend entrypoint not found: {main_file}")
+def build_heartbeat_payload(state: str) -> dict[str, object]:
+    return {
+        "pid": os.getpid(),
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the heartbeat implementation by inlining helpers, reducing state in the loop, and centralizing logging policy to make the code easier to follow.

You can simplify the heartbeat implementation without changing behavior in any meaningful way for this feature.

### 1. Flatten `build_heartbeat_payload` and `atomic_write_json` into `write_startup_heartbeat`

For this internal heartbeat file, the extra helper layers don’t add much clarity, but they do add mental overhead. You can keep the atomic-write behavior while reducing the helper depth:

```py
def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    payload = {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }
    temp_path = path.with_name(f"{path.name}.tmp")

    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        temp_path.write_text(
            json.dumps(payload, separators=(",", ":")),
            encoding="utf-8",
        )
        temp_path.replace(path)
        return True
    except Exception as exc:
        # Best-effort cleanup; ignore errors.
        try:
            temp_path.unlink(missing_ok=True)
        except Exception:
            pass

        if warn_on_error:
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path}: "
                f"{exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return False
```

This removes two helpers (`build_heartbeat_payload`, `atomic_write_json`) while preserving the atomic update behavior and logging.

### 2. Simplify `heartbeat_loop` logging policy

The dual-boolean state (`had_successful_write`, `warning_emitted_since_last_success`) plus `should_warn` is a lot of logic for a logging policy. A time-based throttle keeps visibility and is easier to follow.

For example, log at most once every N seconds:

```py
LOG_THROTTLE_SECONDS = 30.0  # or whatever is reasonable

def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    last_log_time = 0.0

    while not stop_event.wait(interval_seconds):
        now = time.time()
        warn_now = (now - last_log_time) >= LOG_THROTTLE_SECONDS

        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
        if not ok and warn_now:
            last_log_time = now
```

If you still want “every failure before first success, then throttled”, you can keep a single `had_successful_write` flag instead of two booleans plus a closure:

```py
def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    had_successful_write = False
    last_log_time = 0.0
    LOG_THROTTLE_SECONDS = 30.0

    while not stop_event.wait(interval_seconds):
        now = time.time()
        warn_now = (not had_successful_write) or (
            (now - last_log_time) >= LOG_THROTTLE_SECONDS
        )

        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
        if ok:
            had_successful_write = True
        elif warn_now:
            last_log_time = now
```

This preserves the “pre-first-success: log every failure; post-first-success: throttled” semantics with much simpler state.

### 3. Consider removing `warn_on_error` + `bool` return if logging is unified

If you adopt a simple throttle inside the loop, you can also simplify `write_startup_heartbeat` to only raise or only log, and keep the logging policy in one place. For example:

```py
def write_startup_heartbeat(path: Path, state: str) -> None:
    # same body as above but raise instead of returning bool / warn_on_error
    ...
```

Then in `heartbeat_loop`:

```py
try:
    write_startup_heartbeat(path, "starting")
except Exception as exc:
    if warn_now:
        print(
            f"[startup-heartbeat] failed to write heartbeat to {path}: "
            f"{exc.__class__.__name__}: {exc}",
            file=sys.stderr,
        )
```

This removes the `warn_on_error` flag and boolean return, and concentrates the logging policy in the loop.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

struct StartupHeartbeatTracker {
last_seen_at: Option<SystemTime>,
last_progress_at: Option<Instant>,
consecutive_invalid_reads: u8,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider simplifying the StartupHeartbeatTracker state machine by replacing the invalid-read counter with a boolean flag and inlining the freshness helper logic to make the control flow easier to follow.

You can keep all behavior but trim the state machine in StartupHeartbeatTracker by replacing the u8 counter and helper around Option<Instant> with simpler fields.

1. Replace consecutive_invalid_reads: u8 with a boolean

You only ever compare the counter to a small constant and reset it to 0, so a boolean flag expresses the intent (“we already saw one invalid/missing read”) more directly:

#[derive(Debug, Clone, Copy)]
struct StartupHeartbeatTracker {
    last_seen_at: Option<SystemTime>,
    last_progress_at: Option<Instant>,
    had_invalid_read: bool,
    logged_fresh: bool,
}

Then in step_startup_heartbeat:

match (previous, current) {
    (Some(previous), None) => {
        if !state.had_invalid_read {
            state.had_invalid_read = true;
            return Ok(());
        }

        let heartbeat_age_ms = describe_heartbeat_age(Some(previous), wall_now);
        append_desktop_log(&format!(
            "backend startup heartbeat disappeared or became invalid before HTTP dashboard became ready: last_valid_age_ms={heartbeat_age_ms}"
        ));
        return Err(
            "Backend startup heartbeat disappeared or became invalid before HTTP readiness."
                .to_string(),
        );
    }
    (None, None) => {
        state.had_invalid_read = false;
        Ok(())
    }
    (_, Some(current)) => {
        state.had_invalid_read = false;
        // existing logic...
    }
}

This preserves the “one grace failure” behavior of STARTUP_HEARTBEAT_INVALID_READ_THRESHOLD == 2 while making the disappearance handling easier to follow.

2. Inline startup_heartbeat_progress_is_fresh and drop the Option parameter

startup_heartbeat_progress_is_fresh is currently only used with state.last_progress_at, and immediately short-circuits on None. You can move that logic into step_startup_heartbeat and keep last_progress_at as an Option<Instant> without a separate helper:

// inside the (_, Some(current)) arm after updating state.last_progress_at:

let is_fresh = match state.last_progress_at {
    Some(last) => monotonic_now.duration_since(last) <= idle_timeout,
    None => true, // treat as fresh until first heartbeat progress is recorded
};

if is_fresh {
    if !state.logged_fresh {
        append_desktop_log(
            "backend startup heartbeat is fresh while HTTP dashboard is not ready yet; waiting",
        );
        state.logged_fresh = true;
    }
    Ok(())
} else {
    append_desktop_log(
        "backend startup heartbeat went stale before HTTP dashboard became ready",
    );
    Err(format!(
        "Backend startup heartbeat went stale after {}ms without HTTP readiness.",
        idle_timeout.as_millis()
    ))
}

With this change you can delete:

fn startup_heartbeat_progress_is_fresh(
    last_progress_at: Option<Instant>,
    now: Instant,
    max_age: Duration,
) -> bool { ... }

These two adjustments keep all existing behavior (including the grace-miss semantics and stale detection) but make the state and control flow less intricate.

main_file = APP_DIR / "main.py"
if not main_file.is_file():
raise FileNotFoundError(f"Backend entrypoint not found: {main_file}")
def build_heartbeat_payload(state: str) -> dict[str, object]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider simplifying the heartbeat implementation by inlining helpers, reducing state in the loop, and centralizing logging policy to make the code easier to follow.

You can simplify the heartbeat implementation without changing behavior in any meaningful way for this feature.

1. Flatten build_heartbeat_payload and atomic_write_json into write_startup_heartbeat

For this internal heartbeat file, the extra helper layers don’t add much clarity, but they do add mental overhead. You can keep the atomic-write behavior while reducing the helper depth:

def write_startup_heartbeat(
    path: Path, state: str, *, warn_on_error: bool = False
) -> bool:
    payload = {
        "pid": os.getpid(),
        "state": state,
        "updated_at_ms": int(time.time() * 1000),
    }
    temp_path = path.with_name(f"{path.name}.tmp")

    try:
        path.parent.mkdir(parents=True, exist_ok=True)
        temp_path.write_text(
            json.dumps(payload, separators=(",", ":")),
            encoding="utf-8",
        )
        temp_path.replace(path)
        return True
    except Exception as exc:
        # Best-effort cleanup; ignore errors.
        try:
            temp_path.unlink(missing_ok=True)
        except Exception:
            pass

        if warn_on_error:
            print(
                f"[startup-heartbeat] failed to write heartbeat to {path}: "
                f"{exc.__class__.__name__}: {exc}",
                file=sys.stderr,
            )
        return False

This removes two helpers (build_heartbeat_payload, atomic_write_json) while preserving the atomic update behavior and logging.

2. Simplify heartbeat_loop logging policy

The dual-boolean state (had_successful_write, warning_emitted_since_last_success) plus should_warn is a lot of logic for a logging policy. A time-based throttle keeps visibility and is easier to follow.

For example, log at most once every N seconds:

LOG_THROTTLE_SECONDS = 30.0  # or whatever is reasonable

def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    last_log_time = 0.0

    while not stop_event.wait(interval_seconds):
        now = time.time()
        warn_now = (now - last_log_time) >= LOG_THROTTLE_SECONDS

        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
        if not ok and warn_now:
            last_log_time = now

If you still want “every failure before first success, then throttled”, you can keep a single had_successful_write flag instead of two booleans plus a closure:

def heartbeat_loop(
    path: Path, interval_seconds: float, stop_event: threading.Event
) -> None:
    had_successful_write = False
    last_log_time = 0.0
    LOG_THROTTLE_SECONDS = 30.0

    while not stop_event.wait(interval_seconds):
        now = time.time()
        warn_now = (not had_successful_write) or (
            (now - last_log_time) >= LOG_THROTTLE_SECONDS
        )

        ok = write_startup_heartbeat(path, "starting", warn_on_error=warn_now)
        if ok:
            had_successful_write = True
        elif warn_now:
            last_log_time = now

This preserves the “pre-first-success: log every failure; post-first-success: throttled” semantics with much simpler state.

3. Consider removing warn_on_error + bool return if logging is unified

If you adopt a simple throttle inside the loop, you can also simplify write_startup_heartbeat to only raise or only log, and keep the logging policy in one place. For example:

def write_startup_heartbeat(path: Path, state: str) -> None:
    # same body as above but raise instead of returning bool / warn_on_error
    ...

Then in heartbeat_loop:

try:
    write_startup_heartbeat(path, "starting")
except Exception as exc:
    if warn_now:
        print(
            f"[startup-heartbeat] failed to write heartbeat to {path}: "
            f"{exc.__class__.__name__}: {exc}",
            file=sys.stderr,
        )

This removes the warn_on_error flag and boolean return, and concentrates the logging policy in the loop.

@zouyonghe zouyonghe merged commit e90f366 into AstrBotDevs:main Apr 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants