You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.
Describe the bug
I have a runner installed on aws ecs using waypoint runner install, pointing to the prod HCP waypoint server.
Currently, every remote operation behaves like this:
$ wp deploy
» Deploying acmeapp1...
» Operation is queued waiting for job "01GQTP02MT4PDYE8SCSFSP9CHC". Waiting for runner assignment...
If you interrupt this command, the job will still run in the background.
According to waypoint job list, we're waiting for the static runner to take the StartTask job.
Here are the runner's most recent logs, according to cloudwatch:
| 2023-01-25T18:16:57.441-05:00 | 2023-01-25T23:16:57.441Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
-- | -- | --
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
It's currently 2023-01-27, so it looks like the HCP server went down briefly on 2023-01-26T18:11:00.128-05:00, and it caused the runner to become stuck.
I've tcping'd the runner's health check port 1234, and it's still open.
If we don't see the hang from a code walkthrough, we should at least add some more logging before each of those points.
Workaround
Stopping the runner task and letting ECS spin up a new one fixed the problem. The new runner was able to accept jobs.
NOTE: if you don't want the runner to start executing the full backlog of jobs that built up during the hang, cancel all Queued jobs with waypoint job cancel first.
Steps to Reproduce
Run a static runner on ecs
Wait for an eventual hang
Expected behavior
Waypoint runner should not hang
Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:
Waypoint CLI Version: 0.10.5
Waypoint Server Platform and Version: (like docker, nomad, kubernetes): HCP
Additional context
If anyone else sees this, add a 👍
The text was updated successfully, but these errors were encountered:
2023-01-25T16:46:23.421-06:00 | 2023-01-25T22:46:23.421Z [DEBUG] waypoint.runner.agent.runner: sending job completion: job_id=01GQNHP22S644CP4GGJTGAAC52 job_op=*gen.Job_StopTask
-- | --
| 2023-01-25T16:46:23.450-06:00 | 2023-01-25T22:46:23.450Z [DEBUG] waypoint.runner.agent.runner: opening job stream: retry=false
| 2023-01-25T16:46:23.450-06:00 | 2023-01-25T22:46:23.450Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
| 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
| 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
Describe the bug
I have a runner installed on aws ecs using
waypoint runner install
, pointing to the prod HCP waypoint server.Currently, every remote operation behaves like this:
According to
waypoint job list
, we're waiting for the static runner to take the StartTask job.Here are the runner's most recent logs, according to cloudwatch:
It's currently 2023-01-27, so it looks like the HCP server went down briefly on
2023-01-26T18:11:00.128-05:00
, and it caused the runner to become stuck.I've
tcping'd
the runner's health check port1234
, and it's still open.I'd like to get in there and take a thread dump, but it looks like enabling exec on aws ecs is non-trivial, and needs to be set up before the task is launched: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html
Based on the logs, I bet it's hanging somewhere in here: https://github.com/hashicorp/waypoint/blob/main/internal/runner/accept.go#L191-L267
My money is on here:
waypoint/internal/runner/accept.go
Line 206 in 5d8a671
Or here:
waypoint/internal/runner/accept.go
Line 221 in 5d8a671
If we don't see the hang from a code walkthrough, we should at least add some more logging before each of those points.
Workaround
Stopping the runner task and letting ECS spin up a new one fixed the problem. The new runner was able to accept jobs.
NOTE: if you don't want the runner to start executing the full backlog of jobs that built up during the hang, cancel all Queued jobs with
waypoint job cancel
first.Steps to Reproduce
Expected behavior
Waypoint runner should not hang
Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:
docker
,nomad
,kubernetes
): HCPAdditional context
If anyone else sees this, add a 👍
The text was updated successfully, but these errors were encountered: