diff --git a/dev-doc/host-config-timeout-progress-logging.md b/dev-doc/host-config-timeout-progress-logging.md new file mode 100644 index 00000000..01502029 --- /dev/null +++ b/dev-doc/host-config-timeout-progress-logging.md @@ -0,0 +1,414 @@ +# Host Configuration Timeout Progress Logging + +**Issue:** [#748 — Improve reporting of host configuration script timeouts](https://github.com/aws-deadline/deadline-cloud-worker-agent/issues/748) + +**Date:** 2026-04-02 + +## Problem + +The host configuration timeout is enforced server-side — when the timeout is +reached, the service backend kills the worker. The agent itself has no +client-side timeout enforcement. From the agent's perspective, it is simply +running a script and waiting for it to finish. When the service kills the +worker, the final logs show a generic `exit code: 1` and a `Windows Service is +being stopped` message with no indication that the failure was caused by a +timeout. Operators must manually correlate timestamps to figure out what +happened. + +There is zero in-flight visibility into how long the script has been running or +how much time remains before the service kills the worker. This makes debugging +host configuration failures unnecessarily difficult — especially since the +worker is killed externally and never gets a chance to log "I was timed out." + +## Summary + +We add a daemon timer thread in `HostConfigurationScriptRunner.run()` that +periodically logs elapsed and remaining time while the host configuration +script executes. We do not extend `ScriptRunnerBase` or use its built-in +`_runtime_limit` timer because that mechanism is designed for client-side +process termination — host configuration timeout is enforced server-side, and +our timer is purely observational. The timer emits standard +`WorkerHostConfigurationLogEvent` structured logs consistent with the existing +conventions in [logging.md](../docs/logging.md), using the `Worker / +HostConfiguration` type with no new fields or enum values. + +## Proposal + +Add a background daemon thread inside `HostConfigurationScriptRunner.run()` that +periodically logs elapsed time and remaining time while the host configuration +script executes. The thread prints at 30-second intervals, then accelerates to +10-second intervals when ≤60 seconds remain. + +``` +Host Config Time — Elapsed: 120s, Remaining: 180s +Host Config Time — Elapsed: 150s, Remaining: 150s +... +Host Config Time — Elapsed: 250s, Remaining: 50s ← switches to 10s interval +Host Config Time — Elapsed: 260s, Remaining: 40s +``` + +## Current Architecture + +### Execution flow + +``` +HostConfigurationScriptRunner.run() +├── _write_script_file() # materializes script to disk +├── _run_posix() # posix path +│ ├── self._run(command) # calls ScriptRunnerBase._run() — NO time_limit passed +│ └── _action_event.wait() # blocks until callback fires (no timeout arg) +└── _run_win32(script_file_path) # win32 path + └── _WindowsScriptRunner.run_powershell() + └── ShellExecuteEx + WaitForSingleObject loop # blocks until process exits +``` + +### How timeout is enforced today + +The timeout is enforced **server-side only**. The service backend tracks how +long the worker has been in host configuration and kills the worker when the +timeout (`scriptTimeoutSeconds`) is reached. The agent has no client-side +timeout enforcement — it simply runs the script and waits indefinitely for it +to complete. + +The timeout value (`_host_configuration_timeout_seconds`) is passed to the +child process as the `HOST_CONFIG_TIMEOUT_SECONDS` environment variable (so the +customer's script can optionally self-enforce), but the agent itself does +nothing with it beyond that. + +This means when a timeout occurs: +1. The service kills the worker externally +2. The agent process dies mid-execution +3. The last logs show a generic exit code with no timeout context +4. The operator has no way to tell from logs alone that a timeout happened + +### Base class capabilities (ScriptRunnerBase) + +`ScriptRunnerBase._run()` from +[openjd-sessions](https://github.com/OpenJobDescription/openjd-sessions-for-python/blob/519c971/src/openjd/sessions/_runner_base.py) +accepts an optional `time_limit: Optional[timedelta]` parameter. When provided, +it creates a `threading.Timer` that fires `_on_timelimit()`, which: + +1. Sets `self._runtime_limit_reached = True` +2. Calls `self.cancel()` → `_cancel(TerminateCancelMethod())` → `self._process.terminate()` + +The base class also has: +- A `Lock` (`self._lock`) protecting subprocess state mutations +- A `ThreadPoolExecutor(max_workers=1)` running the subprocess future +- A `_on_process_exit` callback that cleans up timers + +**Key observation:** `_run_posix()` calls `self._run(command)` without passing +`time_limit`. This means the base class's built-in timeout/cancel machinery is +completely unused for host configuration. The Windows path bypasses +`ScriptRunnerBase._run()` entirely, using its own `_WindowsScriptRunner`. + +This is intentional — the agent does not enforce the timeout client-side. The +service backend is the authority that kills the worker on timeout. + +### Implication for this design + +We should **not** try to use the base class `time_limit` mechanism for the +progress timer because: + +1. The base class `Timer` is a one-shot kill timer, not a periodic progress + logger. Repurposing it would fight the abstraction. +2. On Windows, `ScriptRunnerBase._run()` is not used at all — the Windows path + goes through `_WindowsScriptRunner` which has its own process management. +3. The progress timer is purely observational (logging only). It should not + interact with process lifecycle or cancellation state. + +A standalone timer thread in `run()` is the cleanest approach — it wraps both +platform paths uniformly without touching the base class contract. + +## Thread Safety Analysis + +### Python's logging module + +Python's `logging.Handler.emit()` acquires `self.lock` (a `threading.RLock`) +before writing. The base `Handler` class guarantees this in `Handler.handle()`: + +```python +# From cpython/Lib/logging/__init__.py +def handle(self, record): + ... + self.acquire() # acquires self.lock (RLock) + try: + self.emit(record) + finally: + self.release() +``` + +So calling `logger.info(...)` from any thread is safe at the `logging` module +level. + +### CloudWatchHandler + +`CloudWatchHandler.emit()` appends to a `collections.deque` +(`self._log_event_queue`). Python's `deque.append()` is atomic under the GIL +(it's a single C-level operation). A separate `CloudWatchLogStreamThread` +consumes from this deque. No explicit lock is needed — the GIL + atomic +`deque.append` is sufficient. + +### StreamHandler (console) and RotatingFileHandler + +Both inherit from `Handler` and acquire the handler-level `RLock` in +`handle()` before calling `emit()`. Thread-safe by design. + +### Conclusion + +**Calling `self._log.info(...)` from the timer thread is safe.** All handlers +in the chain (console `StreamHandler`, `RotatingFileHandler`, +`CloudWatchHandler`) are thread-safe. No additional locking is needed in the +timer. + +### What about the LoggerAdapter? + +`HostConfigurationScriptRunner` has two logger references: +- `self._log` — the raw `Logger` passed from the entrypoint +- `self._logger_adapter` — a `LoggerAdapter` wrapping `self._log` + +The `LoggerAdapter` is a thin wrapper that calls through to the underlying +logger. It adds no state that would be unsafe across threads. However, the +timer thread should use `self._log` (the raw logger) directly, not +`self._logger_adapter`, because: + +1. The `LoggerAdapter` adds `worker_id` via `extra`, but our + `WorkerHostConfigurationLogEvent` already includes `worker_id` in its own + structured output. +2. Using the raw logger avoids any coupling to the adapter's `extra` dict + processing from a concurrent thread. + +## Design + +### Where to add the thread + +The timer thread starts and stops inside `HostConfigurationScriptRunner.run()`, +wrapping the platform-specific execution call. This keeps the change contained +to a single method and works for both posix and Windows. + +```python +# In host_configuration_script.py + +def run(self) -> int: + ... + timer = _HostConfigTimer( + timeout_seconds=self._host_configuration_timeout_seconds, + logger=self._log, + farm_id=self._configuration.farm_id, + fleet_id=self._configuration.fleet_id, + worker_id=self._worker_id, + ) + timer.start() + try: + with FileContext(script_file_path) as _: + if sys.platform == "win32": + exit_code = self._run_win32(script_file_path) + else: + exit_code = self._run_posix() + finally: + timer.stop() + ... +``` + +### Timer class + +A small, self-contained helper class. Not a subclass of `threading.Thread` — +it owns a thread internally and exposes `start()` / `stop()`. + +```python +import time +from threading import Thread, Event + +class _HostConfigTimer: + """Logs elapsed/remaining time while the host configuration script runs.""" + + _NORMAL_INTERVAL_S: int = 30 + _ACCELERATED_INTERVAL_S: int = 10 + _ACCELERATE_THRESHOLD_S: int = 60 + + def __init__( + self, + timeout_seconds: int, + logger: Logger, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + self._timeout_seconds = timeout_seconds + self._logger = logger + self._farm_id = farm_id + self._fleet_id = fleet_id + self._worker_id = worker_id + self._stop_event = Event() + + def start(self) -> None: + self._thread = Thread(target=self._run, daemon=True, name="host-config-timer") + self._thread.start() + + def stop(self) -> None: + self._stop_event.set() + self._thread.join(timeout=5) + + def _run(self) -> None: + start_time = time.monotonic() + while not self._stop_event.is_set(): + elapsed = time.monotonic() - start_time + remaining = max(0, self._timeout_seconds - elapsed) + + interval = ( + self._ACCELERATED_INTERVAL_S + if remaining <= self._ACCELERATE_THRESHOLD_S + else self._NORMAL_INTERVAL_S + ) + + self._stop_event.wait(timeout=interval) + if self._stop_event.is_set(): + break + + elapsed = time.monotonic() - start_time + remaining = max(0, self._timeout_seconds - elapsed) + + self._logger.info( + WorkerHostConfigurationLogEvent( + farm_id=self._farm_id, + fleet_id=self._fleet_id, + worker_id=self._worker_id, + message=( + f"Host Config Time — Elapsed: {int(elapsed)}s, " + f"Remaining: {int(remaining)}s" + ), + status=WorkerHostConfigurationStatus.RUNNING, + ) + ) + + if remaining <= 0: + self._logger.warning( + WorkerHostConfigurationLogEvent( + farm_id=self._farm_id, + fleet_id=self._fleet_id, + worker_id=self._worker_id, + message=( + f"Host configuration timeout reached " + f"({self._timeout_seconds}s). " + f"Worker may be terminated by the service." + ), + status=WorkerHostConfigurationStatus.FAILED, + ) + ) + break +``` + +### Why not use ScriptRunnerBase._runtime_limit? + +| Concern | Base class Timer | Our _HostConfigTimer | +|---|---|---| +| Purpose | One-shot kill after timeout | Periodic progress logging | +| Fires | Once, then cancels the process | Every 30s/10s | +| Platform | Posix only (via `_run()`) | Both posix and Windows | +| Side effects | Terminates subprocess | None — logging only | + +The base class timer is a process-lifecycle mechanism designed for client-side +timeout enforcement. Our timer is purely observational — it provides visibility +into the countdown before the **service backend** kills the worker. It must not +terminate the subprocess or interact with process lifecycle, since timeout +enforcement is the service's responsibility. + +### Interval logic + +| Remaining time | Log interval | +|---|---| +| > 60s | Every 30s | +| ≤ 60s | Every 10s | +| 0s | One final WARNING log, then thread exits | + +The thread uses `Event.wait(timeout=interval)` instead of `time.sleep()` so it +can be stopped immediately when the script finishes — no dangling sleeps. + +### Log format + +Uses the existing `WorkerHostConfigurationLogEvent` structured log so the +messages are consistent with all other host configuration logs and are +parseable by CloudWatch / log aggregation. Per the project's +[logging conventions](../docs/logging.md), these are typed logs +(`type: Worker`, `subtype: HostConfiguration`) which means they are part of the +stable structured log contract and can be relied upon for programmatic queries +— unlike untyped messages which "may change at any time." + +The timer reuses `WorkerHostConfigurationStatus.RUNNING` for progress messages +(same status used at script start) and `WorkerHostConfigurationStatus.FAILED` +for the timeout-reached warning. No new enum values or log fields are needed. + +Logs are emitted via the standard `Logger` which routes to all configured +handlers: local rotating files (`worker-agent.log`, `worker-agent-bootstrap.log`) +and the CloudWatch log stream (`/aws/deadline//` / +``). Progress messages will appear in both destinations. + +Example output in the structured JSON log stream: + +```json +{"level": "INFO", "ti": "📜", "type": "Worker", "subtype": "HostConfiguration", + "message": "Host Config Time — Elapsed: 120s, Remaining: 180s", + "status": "Running", ...} +``` + +When timeout is reached: + +```json +{"level": "WARNING", "ti": "📜", "type": "Worker", "subtype": "HostConfiguration", + "message": "Host configuration timeout reached (300s). Worker may be terminated by the service.", + "status": "Failed", ...} +``` + +## Files to Change + +| File | Change | +|---|---| +| `src/deadline_worker_agent/startup/host_configuration_script.py` | Add `_HostConfigTimer` class; wrap the execution block in `run()` with `timer.start()` / `timer.stop()` | +| `test/unit/startup/test_host_configuration_timer.py` (new) | Unit tests for `_HostConfigTimer` | +| `test/integ/startup/test_host_configuration.py` | Verify timer logs appear during integration test runs | + +No changes to `log_messages.py`, `entrypoint.py`, or the Windows runner — the +timer is entirely self-contained in the runner's `run()` method. + +## Edge Cases + +1. **Script finishes before first interval** — `stop()` sets the event before + the first `wait()` completes; no progress log is emitted. This is fine for + fast scripts. + +2. **Timeout of 0 or very small values** — The thread will immediately log the + timeout-reached warning and exit. No infinite loop risk. + +3. **Thread safety** — See analysis above. All handlers in the logging chain + are thread-safe. The timer thread only reads immutable init-time state and + calls `logger.info()`. + +4. **Daemon thread and process termination** — Marked `daemon=True` so it won't + prevent process exit. In the normal path (script completes), the `finally` + block calls `stop()` for clean shutdown. When the service backend kills the + worker via SIGTERM/SIGKILL, the entire process dies and the daemon thread + dies with it — no cleanup runs, which is fine. The timer holds no resources + that need explicit release (no file handles, no network connections). This + is the expected termination path for a timeout scenario. + +5. **Interaction with base class Lock** — The timer thread does not touch + `self._lock`, `self._process`, `self._run_future`, or any other + `ScriptRunnerBase` state. It only calls `self._log.info()`. No deadlock + risk. + +6. **LoggerAdapter vs Logger** — Timer uses `self._log` (raw `Logger`), not + `self._logger_adapter`. The `LoggerAdapter` adds `extra` dict processing + that is unnecessary since `WorkerHostConfigurationLogEvent` already carries + all structured fields. This also avoids any theoretical concern about + concurrent `extra` dict access. + +## Testing Strategy + +- **Unit tests**: Mock `time.monotonic` to control elapsed time. Verify: + - Correct number of log messages at 30s intervals + - Switch to 10s intervals at ≤60s remaining + - Timeout-reached warning is logged + - `stop()` terminates the thread promptly + - No log emitted when script finishes before first interval +- **Integration tests**: Run a short script (e.g., `sleep 5` with timeout=300) + and verify no timer logs appear (script finishes before first 30s interval). + Run a longer script and verify timer logs are present. diff --git a/src/deadline_worker_agent/startup/host_configuration_script.py b/src/deadline_worker_agent/startup/host_configuration_script.py index 2702f9a9..e4aa38c3 100644 --- a/src/deadline_worker_agent/startup/host_configuration_script.py +++ b/src/deadline_worker_agent/startup/host_configuration_script.py @@ -1,9 +1,10 @@ # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. +import time from datetime import timedelta from logging import Logger from pathlib import Path -from threading import Event +from threading import Event, Thread from openjd.model import SymbolTable from typing import Optional import sys @@ -31,6 +32,94 @@ from ..windows.win_admin_runner import _WindowsScriptRunner +class _HostConfigTimer: + """Periodically logs elapsed and remaining time while the host configuration script runs. + + The host configuration timeout is enforced server-side — the service backend kills the worker + when the timeout is reached. This timer provides client-side visibility into the countdown + so operators can see progress in the logs before the worker is terminated. + + Logs every 30s normally, accelerating to every 10s when ≤60s remain. + """ + + _NORMAL_INTERVAL_S: int = 30 + _ACCELERATED_INTERVAL_S: int = 10 + _ACCELERATE_THRESHOLD_S: int = 60 + + def __init__( + self, + *, + timeout_seconds: int, + logger: Logger, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + self._timeout_seconds = timeout_seconds + self._logger = logger + self._farm_id = farm_id + self._fleet_id = fleet_id + self._worker_id = worker_id + self._stop_event = Event() + self._thread: Optional[Thread] = None + + def start(self) -> None: + self._thread = Thread(target=self._run, daemon=True, name="host-config-timer") + self._thread.start() + + def stop(self) -> None: + self._stop_event.set() + if self._thread is not None: + self._thread.join(timeout=5) + + def _run(self) -> None: + start_time = time.monotonic() + + while not self._stop_event.is_set(): + elapsed = time.monotonic() - start_time + remaining = max(0, self._timeout_seconds - elapsed) + + interval = ( + self._ACCELERATED_INTERVAL_S + if remaining <= self._ACCELERATE_THRESHOLD_S + else self._NORMAL_INTERVAL_S + ) + + self._stop_event.wait(timeout=interval) + if self._stop_event.is_set(): + break + + elapsed = time.monotonic() - start_time + remaining = max(0, self._timeout_seconds - elapsed) + + self._logger.info( + WorkerHostConfigurationLogEvent( + farm_id=self._farm_id, + fleet_id=self._fleet_id, + worker_id=self._worker_id, + message=( + f"Host Config Time — Elapsed: {int(elapsed)}s, Remaining: {int(remaining)}s" + ), + status=WorkerHostConfigurationStatus.RUNNING, + ) + ) + + if remaining <= 0: + self._logger.warning( + WorkerHostConfigurationLogEvent( + farm_id=self._farm_id, + fleet_id=self._fleet_id, + worker_id=self._worker_id, + message=( + f"Host config timeout expired ({self._timeout_seconds}s). " + f"Worker may be terminated by the service." + ), + status=WorkerHostConfigurationStatus.FAILED, + ) + ) + break + + class HostConfigurationScriptRunner(ScriptRunnerBase): """Host Configuration Script Runner. Borrows from OpenJD Script Runner Base, similar to Session Actions""" @@ -139,11 +228,22 @@ def run(self) -> int: logger=self._logger_adapter, section_title="Running Host Configuration Script" ) - with FileContext(script_file_path) as _: - if sys.platform == "win32": - exit_code = self._run_win32(script_file_path) - else: - exit_code = self._run_posix() + timer = _HostConfigTimer( + timeout_seconds=self._host_configuration_timeout_seconds, + logger=self._log, + farm_id=self._configuration.farm_id, + fleet_id=self._configuration.fleet_id, + worker_id=self._worker_id, + ) + timer.start() + try: + with FileContext(script_file_path) as _: + if sys.platform == "win32": + exit_code = self._run_win32(script_file_path) + else: + exit_code = self._run_posix() + finally: + timer.stop() self._log_section_banner( logger=self._logger_adapter, diff --git a/test/unit/startup/test_host_config_timer.py b/test/unit/startup/test_host_config_timer.py new file mode 100644 index 00000000..5a8ae304 --- /dev/null +++ b/test/unit/startup/test_host_config_timer.py @@ -0,0 +1,345 @@ +# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. + +"""Tests for the _HostConfigTimer class""" + +from __future__ import annotations + +import logging +import time +from unittest.mock import MagicMock, patch + +import pytest + +from deadline_worker_agent.startup.host_configuration_script import _HostConfigTimer +from deadline_worker_agent.log_messages import ( + WorkerHostConfigurationLogEvent, + WorkerHostConfigurationStatus, +) + + +@pytest.fixture +def logger() -> MagicMock: + return MagicMock(spec=logging.Logger) + + +@pytest.fixture +def farm_id() -> str: + return "farm-00000000000000000000000000000000" + + +@pytest.fixture +def fleet_id() -> str: + return "fleet-00000000000000000000000000000000" + + +@pytest.fixture +def worker_id() -> str: + return "worker-00000000000000000000000000000000" + + +class TestHostConfigTimer: + """Tests for _HostConfigTimer""" + + def test_stop_before_first_interval_emits_no_logs( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """When the script finishes before the first 30s interval, no progress log is emitted.""" + timer = _HostConfigTimer( + timeout_seconds=300, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + timer.start() + # Stop immediately + timer.stop() + + logger.info.assert_not_called() + logger.warning.assert_not_called() + + def test_emits_progress_log_after_interval( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Timer emits a progress log after the first interval elapses.""" + # Use a short timeout so the test runs quickly + timer = _HostConfigTimer( + timeout_seconds=300, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + + # Patch the interval constants to make the test fast + with ( + patch.object(_HostConfigTimer, "_NORMAL_INTERVAL_S", 0.1), + patch.object(_HostConfigTimer, "_ACCELERATED_INTERVAL_S", 0.05), + patch.object(_HostConfigTimer, "_ACCELERATE_THRESHOLD_S", 60), + ): + timer.start() + # Wait enough for at least one interval to fire + time.sleep(0.3) + timer.stop() + + # Should have at least one info log + assert logger.info.call_count >= 1 + log_event = logger.info.call_args_list[0][0][0] + assert isinstance(log_event, WorkerHostConfigurationLogEvent) + assert "Host Config Time" in log_event.msg + assert "Elapsed:" in log_event.msg + assert "Remaining:" in log_event.msg + assert log_event.status == WorkerHostConfigurationStatus.RUNNING + + def test_emits_timeout_warning_when_timeout_reached( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Timer emits a WARNING log when the timeout is reached.""" + # Very short timeout so it expires quickly + timer = _HostConfigTimer( + timeout_seconds=0, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + + with ( + patch.object(_HostConfigTimer, "_NORMAL_INTERVAL_S", 0.05), + patch.object(_HostConfigTimer, "_ACCELERATED_INTERVAL_S", 0.05), + ): + timer.start() + time.sleep(0.2) + timer.stop() + + # Should have a warning about timeout reached + assert logger.warning.call_count >= 1 + warning_event = logger.warning.call_args_list[0][0][0] + assert isinstance(warning_event, WorkerHostConfigurationLogEvent) + assert "timeout expired" in warning_event.msg.lower() + assert "Worker may be terminated by the service" in warning_event.msg + assert warning_event.status == WorkerHostConfigurationStatus.FAILED + + def test_accelerates_interval_when_remaining_low( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Timer switches to accelerated interval when remaining time <= threshold.""" + # Timeout of 1s with threshold at 60s means we're always in accelerated mode + timer = _HostConfigTimer( + timeout_seconds=1, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + + with ( + patch.object(_HostConfigTimer, "_NORMAL_INTERVAL_S", 10), + patch.object(_HostConfigTimer, "_ACCELERATED_INTERVAL_S", 0.1), + patch.object(_HostConfigTimer, "_ACCELERATE_THRESHOLD_S", 60), + ): + timer.start() + # With normal interval of 10s, if we weren't accelerating we'd get 0 logs in 0.5s + # With accelerated interval of 0.1s, we should get at least one + time.sleep(0.5) + timer.stop() + + # Should have at least one log due to accelerated interval + assert logger.info.call_count >= 1 + + def test_thread_is_daemon( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Timer thread is a daemon thread so it won't prevent process exit.""" + timer = _HostConfigTimer( + timeout_seconds=300, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + timer.start() + assert timer._thread is not None + assert timer._thread.daemon is True + timer.stop() + + def test_thread_name( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Timer thread has a descriptive name for debugging.""" + timer = _HostConfigTimer( + timeout_seconds=300, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + timer.start() + assert timer._thread is not None + assert timer._thread.name == "host-config-timer" + timer.stop() + + def test_stop_is_idempotent( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Calling stop() multiple times does not raise.""" + timer = _HostConfigTimer( + timeout_seconds=300, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + timer.start() + timer.stop() + # Second stop should not raise + timer.stop() + + def test_stop_without_start( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Calling stop() without start() does not raise.""" + timer = _HostConfigTimer( + timeout_seconds=300, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + # Should not raise + timer.stop() + + def test_progress_log_contains_correct_ids( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """Progress logs contain the correct farm, fleet, and worker IDs.""" + timer = _HostConfigTimer( + timeout_seconds=300, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + + with ( + patch.object(_HostConfigTimer, "_NORMAL_INTERVAL_S", 0.05), + patch.object(_HostConfigTimer, "_ACCELERATED_INTERVAL_S", 0.05), + ): + timer.start() + time.sleep(0.15) + timer.stop() + + assert logger.info.call_count >= 1 + log_event = logger.info.call_args_list[0][0][0] + assert isinstance(log_event, WorkerHostConfigurationLogEvent) + assert log_event.farm_id == farm_id + assert log_event.fleet_id == fleet_id + assert log_event.worker_id == worker_id + + def test_thread_exits_after_timeout_warning( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """After emitting the timeout warning, the timer thread exits on its own.""" + timer = _HostConfigTimer( + timeout_seconds=0, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + + with ( + patch.object(_HostConfigTimer, "_NORMAL_INTERVAL_S", 0.05), + patch.object(_HostConfigTimer, "_ACCELERATED_INTERVAL_S", 0.05), + ): + timer.start() + time.sleep(0.3) + + # Thread should have exited on its own after the timeout warning + assert timer._thread is not None + assert not timer._thread.is_alive() + # Clean up + timer.stop() + + def test_no_further_logs_after_timeout_reached( + self, + logger: MagicMock, + farm_id: str, + fleet_id: str, + worker_id: str, + ) -> None: + """After the timeout is reached, the timer stops logging even if the worker is still + running (i.e. stop() has not been called). The timer thread self-exits after the + timeout warning and does not continue emitting progress logs.""" + timer = _HostConfigTimer( + timeout_seconds=0, + logger=logger, + farm_id=farm_id, + fleet_id=fleet_id, + worker_id=worker_id, + ) + + with ( + patch.object(_HostConfigTimer, "_NORMAL_INTERVAL_S", 0.05), + patch.object(_HostConfigTimer, "_ACCELERATED_INTERVAL_S", 0.05), + ): + timer.start() + # Wait for the timeout warning to fire and the thread to exit + time.sleep(0.3) + + # Record the log counts after the thread has self-exited + info_count_after_timeout = logger.info.call_count + warning_count_after_timeout = logger.warning.call_count + + # Wait more time — simulating the worker still running after timeout + time.sleep(0.3) + + # No additional logs should have been emitted + assert logger.info.call_count == info_count_after_timeout + assert logger.warning.call_count == warning_count_after_timeout + + # Exactly one warning (the timeout-reached message) + assert warning_count_after_timeout == 1 + + # Clean up + timer.stop()