Skip to content

[WIP] feat: add progress timer logging for host configuration script execution#904

Open
leongdl wants to merge 1 commit intoaws-deadline:mainlinefrom
leongdl:hc-tick
Open

[WIP] feat: add progress timer logging for host configuration script execution#904
leongdl wants to merge 1 commit intoaws-deadline:mainlinefrom
leongdl:hc-tick

Conversation

@leongdl
Copy link
Copy Markdown
Contributor

@leongdl leongdl commented Apr 3, 2026

What was the problem/requirement? (What/Why)

When a host configuration script times out, the service backend kills the worker externally. The agent logs show a generic exit code: 1 with no indication that the failure was caused by a timeout. Operators must manually correlate timestamps to figure out what happened. There is zero in-flight visibility into how long the script has been running or how much time remains before the service terminates the worker.

Relates to #748.

What was the solution? (How)

Added a _HostConfigTimer daemon thread class in host_configuration_script.py that periodically logs elapsed and remaining time while the host configuration script executes. The timer:

  • Logs every 30s at INFO level using WorkerHostConfigurationLogEvent (status=RUNNING)
  • Accelerates to every 10s when ≤60s remain
  • Emits a WARNING when the timeout is reached: "Worker may be terminated by the service"
  • Uses Event.wait(timeout=interval) for immediate cancellation when the script finishes
  • Is started/stopped in HostConfigurationScriptRunner.run() wrapping both posix and Windows execution paths

We intentionally did not extend ScriptRunnerBase._runtime_limit because that mechanism is for client-side process termination — host configuration timeout is enforced server-side, and our timer is purely observational.

What is the impact of this change?

Operators will now see periodic countdown logs in both CloudWatch and local log files during host configuration execution. When a timeout occurs, the last few log lines will clearly show the countdown approaching zero with a warning, making it immediately obvious that the failure was timeout-related.

No behavioral change to the agent — the timer is logging-only and does not interact with process lifecycle.

How was this change tested?

  • 10 unit tests in test/unit/startup/test_host_config_timer.py covering:
    • No logs emitted when script finishes before first interval
    • Progress log emitted after interval elapses
    • Timeout warning emitted when timeout reached
    • Interval acceleration when remaining time is low
    • Daemon thread properties (daemon=True, named "host-config-timer")
    • stop() is idempotent and safe without start()
    • Correct farm/fleet/worker IDs in log events
    • Thread self-exits after timeout warning
  • Full test suite: 2693 passed, 0 failed

Was this change documented?

Yes — design doc added at dev-doc/host-config-timeout-progress-logging.md covering architecture, thread safety analysis, base class rationale, and edge cases.

Is this a breaking change?

No.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

How did I implement this:

Here's the sequence of your prompts in this conversation:

Initial request — Read the GitHub issue #748 and host_configuration_script.py, propose adding a background timer thread that logs elapsed/remaining time every 30s (accelerating to 10s at ≤1 min), write a design doc under dev-doc/.

Read the base class — Also read the OpenJD ScriptRunnerBase base class for better insight, and consider whether the logger is thread safe.

Clarification on timeout enforcement — "Host config timeout is enforced by the service backend to kill the worker, it is not enforced on the clientside. It is why we need more logs."

Wording fix — "When timeout is reached: -> Worker maybe terminating by the service"

Read logging.md — "Read also logging.md, anything else to update?"

Add summary section — "Put a summary after the problem. We are adding a timer thread, to print a log. We are not extending the base class because. And the logging prints a standard structured message like current logging.md."

Edge case clarification — "What if the system sends a sig term, the worker still dies right" (re: daemon thread cleanup)

Implement and test — "Implement it, then activate the worker agent skill to build and run unit test. Don't forget to add unit tests."

Git workflow — "git add the code changes, make a semantic commit title and then git push origin with set upstream for me to review"

Run CI checks — "run hatch run fmt, hatch run lint and hatch run test, fix any issues"

Stage changes — "stage both src and test code changes"

Fix CI flake — Pasted the CI failure for test_accelerates_interval_when_remaining_low (timing-sensitive assertion assert 1 >= 2).

PR description — "Fill this in properly and give me a non rendered markdown to copy" (the PR template).

This prompt — "What were all my prompts to implement this"

@github-actions github-actions bot added the waiting-on-maintainers Waiting on the maintainers to review. label Apr 3, 2026
@leongdl leongdl force-pushed the hc-tick branch 2 times, most recently from aeb5870 to bf97267 Compare April 3, 2026 02:54
@leongdl leongdl marked this pull request as ready for review April 3, 2026 03:09
@leongdl leongdl requested a review from a team as a code owner April 3, 2026 03:09
Add a background daemon thread that periodically logs elapsed and remaining
time while the host configuration script runs. Logs every 30s normally,
accelerating to every 10s when <=60s remain. Emits a warning when the
server-side timeout is reached.

This improves observability for host configuration timeouts (issue aws-deadline#748),
where the service backend kills the worker with no client-side indication
of why the failure occurred.

- Add _HostConfigTimer class in host_configuration_script.py
- Integrate timer into HostConfigurationScriptRunner.run()
- Add unit tests for timer behavior
- Add design doc under dev-doc/

Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 3, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-maintainers Waiting on the maintainers to review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants