Skip to content

Conversation

@cronik
Copy link
Contributor

@cronik cronik commented Jan 7, 2026

This change updates the ContainerExecProc#kill method to force the finished countdown latch to decrement. It has been observed in some high load clusters where the joinWithTimeout timeout is reached but the proc continues to be blocked.

When joinWithTimeout is called, the kill method is called if the task does not complete in time.

https://github.com/jenkinsci/jenkins/blob/368f1ccbc967a85c0ff801f3729cb77a269afd41/core/src/main/java/hudson/Proc.java#L165

But if kill fails to trigger the finished countdown latch then the join method will continue to wait indefinitely.

By forcing finished.countDown() after close the join should be unblocked even if the ctl-c command didn't trigger the exec listener. countDown is a no-op if the latch is already zero.

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

This change updates the `ContainerExecProc#kill` method
to force the finished countdown latch to decrement. It has
been observed in some high load clusters where the
`joinWithTimeout` timeout is reached but the proc continues
to be blocked.

When `joinWithTimeout` is called, the `kill` method is called if the
task does not complete in time.

https://github.com/jenkinsci/jenkins/blob/368f1ccbc967a85c0ff801f3729cb77a269afd41/core/src/main/java/hudson/Proc.java#L165

But if `kill` fails to trigger the `finished` countdown latch then the
`join` method will continue to wait indefinitely.

https://github.com/jenkinsci/kubernetes-plugin/blob/676ab933d12ad8b25e4d7f78594a32066aad2569/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerExecProc.java#L100

By forcing `finished.countDown()` after `close` the join should be unblocked even if the `ctl-c` command didn't trigger the exec listener. `countDown` is a no-op if the latch is already zero.
@cronik cronik requested a review from a team as a code owner January 7, 2026 02:38
@cronik
Copy link
Contributor Author

cronik commented Jan 10, 2026

Thread dump of deadlock

"org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution [#536]" Id=362583 Group=main WAITING on java.util.concurrent.CountDownLatch$Sync@3621b128
    at [[email protected]](mailto:[email protected])/jdk.internal.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.CountDownLatch$Sync@3621b128
    at [[email protected]](mailto:[email protected])/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
    at [[email protected]](mailto:[email protected])/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
    at [[email protected]](mailto:[email protected])/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1047)
    at [[email protected]](mailto:[email protected])/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecProc.join(ContainerExecProc.java:100)
    at hudson.Proc.joinWithTimeout(Proc.java:172)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution.setDefaultRunAsUser(EphemeralContainerStepExecution.java:428)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution.startEphemeralContainer(EphemeralContainerStepExecution.java:184)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution.startEphemeralContainerWithRetry(EphemeralContainerStepExecution.java:112)
    at PluginClassLoader for kubernetes//org.csanchez.jenkins.plugins.kubernetes.pipeline.EphemeralContainerStepExecution$$Lambda$2125/0x0000000801d38b50.run(Unknown Source)
    at PluginClassLoader for workflow-step-api//org.jenkinsci.plugins.workflow.steps.GeneralNonBlockingStepExecution.lambda$run$0(GeneralNonBlockingStepExecution.java:77)
    at PluginClassLoader for workflow-step-api//org.jenkinsci.plugins.workflow.steps.GeneralNonBlockingStepExecution$$Lambda$2121/0x00000008013ae3b0.run(Unknown Source)
    at [[email protected]](mailto:[email protected])/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at [[email protected]](mailto:[email protected])/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at [[email protected]](mailto:[email protected])/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at [[email protected]](mailto:[email protected])/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at [[email protected]](mailto:[email protected])/java.lang.Thread.run(Thread.java:840)

    Number of locked synchronizers = 1
    - java.util.concurrent.ThreadPoolExecutor$Worker@53a0676c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant