[WIP] feat: add progress timer logging for host configuration script execution#904
Open
leongdl wants to merge 1 commit intoaws-deadline:mainlinefrom
Open
[WIP] feat: add progress timer logging for host configuration script execution#904leongdl wants to merge 1 commit intoaws-deadline:mainlinefrom
leongdl wants to merge 1 commit intoaws-deadline:mainlinefrom
Conversation
aeb5870 to
bf97267
Compare
Add a background daemon thread that periodically logs elapsed and remaining time while the host configuration script runs. Logs every 30s normally, accelerating to every 10s when <=60s remain. Emits a warning when the server-side timeout is reached. This improves observability for host configuration timeouts (issue aws-deadline#748), where the service backend kills the worker with no client-side indication of why the failure occurred. - Add _HostConfigTimer class in host_configuration_script.py - Integrate timer into HostConfigurationScriptRunner.run() - Add unit tests for timer behavior - Add design doc under dev-doc/ Signed-off-by: David Leong <116610336+leongdl@users.noreply.github.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What was the problem/requirement? (What/Why)
When a host configuration script times out, the service backend kills the worker externally. The agent logs show a generic
exit code: 1with no indication that the failure was caused by a timeout. Operators must manually correlate timestamps to figure out what happened. There is zero in-flight visibility into how long the script has been running or how much time remains before the service terminates the worker.Relates to #748.
What was the solution? (How)
Added a
_HostConfigTimerdaemon thread class inhost_configuration_script.pythat periodically logs elapsed and remaining time while the host configuration script executes. The timer:WorkerHostConfigurationLogEvent(status=RUNNING)Event.wait(timeout=interval)for immediate cancellation when the script finishesHostConfigurationScriptRunner.run()wrapping both posix and Windows execution pathsWe intentionally did not extend
ScriptRunnerBase._runtime_limitbecause that mechanism is for client-side process termination — host configuration timeout is enforced server-side, and our timer is purely observational.What is the impact of this change?
Operators will now see periodic countdown logs in both CloudWatch and local log files during host configuration execution. When a timeout occurs, the last few log lines will clearly show the countdown approaching zero with a warning, making it immediately obvious that the failure was timeout-related.
No behavioral change to the agent — the timer is logging-only and does not interact with process lifecycle.
How was this change tested?
test/unit/startup/test_host_config_timer.pycovering:Was this change documented?
Yes — design doc added at
dev-doc/host-config-timeout-progress-logging.mdcovering architecture, thread safety analysis, base class rationale, and edge cases.Is this a breaking change?
No.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
How did I implement this:
Here's the sequence of your prompts in this conversation:
Initial request — Read the GitHub issue #748 and host_configuration_script.py, propose adding a background timer thread that logs elapsed/remaining time every 30s (accelerating to 10s at ≤1 min), write a design doc under dev-doc/.
Read the base class — Also read the OpenJD ScriptRunnerBase base class for better insight, and consider whether the logger is thread safe.
Clarification on timeout enforcement — "Host config timeout is enforced by the service backend to kill the worker, it is not enforced on the clientside. It is why we need more logs."
Wording fix — "When timeout is reached: -> Worker maybe terminating by the service"
Read logging.md — "Read also logging.md, anything else to update?"
Add summary section — "Put a summary after the problem. We are adding a timer thread, to print a log. We are not extending the base class because. And the logging prints a standard structured message like current logging.md."
Edge case clarification — "What if the system sends a sig term, the worker still dies right" (re: daemon thread cleanup)
Implement and test — "Implement it, then activate the worker agent skill to build and run unit test. Don't forget to add unit tests."
Git workflow — "git add the code changes, make a semantic commit title and then git push origin with set upstream for me to review"
Run CI checks — "run hatch run fmt, hatch run lint and hatch run test, fix any issues"
Stage changes — "stage both src and test code changes"
Fix CI flake — Pasted the CI failure for test_accelerates_interval_when_remaining_low (timing-sensitive assertion assert 1 >= 2).
PR description — "Fill this in properly and give me a non rendered markdown to copy" (the PR template).
This prompt — "What were all my prompts to implement this"