Static runner hanging on aws ecs #4460

izaaklauer · 2023-01-27T23:29:54Z

Describe the bug
I have a runner installed on aws ecs using waypoint runner install, pointing to the prod HCP waypoint server.

Currently, every remote operation behaves like this:

$ wp deploy

» Deploying acmeapp1...

» Operation is queued waiting for job "01GQTP02MT4PDYE8SCSFSP9CHC". Waiting for runner assignment...
  If you interrupt this command, the job will still run in the background.

According to waypoint job list, we're waiting for the static runner to take the StartTask job.

Here are the runner's most recent logs, according to cloudwatch:



  | 2023-01-25T18:16:57.441-05:00 | 2023-01-25T23:16:57.441Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
-- | -- | --
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect

It's currently 2023-01-27, so it looks like the HCP server went down briefly on 2023-01-26T18:11:00.128-05:00, and it caused the runner to become stuck.

I've tcping'd the runner's health check port 1234, and it's still open.

I'd like to get in there and take a thread dump, but it looks like enabling exec on aws ecs is non-trivial, and needs to be set up before the task is launched: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html

Based on the logs, I bet it's hanging somewhere in here: https://github.com/hashicorp/waypoint/blob/main/internal/runner/accept.go#L191-L267

My money is on here:

waypoint/internal/runner/accept.go

Line 206 in 5d8a671

streamCtxLock.Lock()

Or here:

waypoint/internal/runner/accept.go

Line 221 in 5d8a671

if r.waitStateGreater(&r.stateConfig, stateGen) {

If we don't see the hang from a code walkthrough, we should at least add some more logging before each of those points.

Workaround

Stopping the runner task and letting ECS spin up a new one fixed the problem. The new runner was able to accept jobs.

NOTE: if you don't want the runner to start executing the full backlog of jobs that built up during the hang, cancel all Queued jobs with waypoint job cancel first.

Steps to Reproduce

Run a static runner on ecs
Wait for an eventual hang

Expected behavior
Waypoint runner should not hang

Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:

Waypoint CLI Version: 0.10.5
Waypoint Server Platform and Version: (like docker, nomad, kubernetes): HCP

Additional context
If anyone else sees this, add a 👍

The text was updated successfully, but these errors were encountered:

cicoyle · 2023-02-02T15:10:33Z

Saw this again in ECS:



2023-01-25T16:46:23.421-06:00 | 2023-01-25T22:46:23.421Z [DEBUG] waypoint.runner.agent.runner: sending job completion: job_id=01GQNHP22S644CP4GGJTGAAC52 job_op=*gen.Job_StopTask
-- | --
  | 2023-01-25T16:46:23.450-06:00 | 2023-01-25T22:46:23.450Z [DEBUG] waypoint.runner.agent.runner: opening job stream: retry=false
  | 2023-01-25T16:46:23.450-06:00 | 2023-01-25T22:46:23.450Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect

Also looks like there is a zombie odr task.

izaaklauer added new jira Will add an Issue to Jira plugin/ecs bug Something isn't working intermittent and removed new labels Jan 27, 2023

izaaklauer linked a pull request Jul 21, 2023 that will close this issue

Not interpreting server NotFound error as server down #4854

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static runner hanging on aws ecs #4460

Static runner hanging on aws ecs #4460

izaaklauer commented Jan 27, 2023 •

edited

Loading

cicoyle commented Feb 2, 2023

Static runner hanging on aws ecs #4460

Static runner hanging on aws ecs #4460

Comments

izaaklauer commented Jan 27, 2023 • edited Loading

cicoyle commented Feb 2, 2023

izaaklauer commented Jan 27, 2023 •

edited

Loading