Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problem that Nextflow Executions stuck on Kubernetes #5501

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Lehmann-Fabian
Copy link
Contributor

When running Nextflow workflows on Kubernetes, workflows would often get stuck with Pods remaining in the Succeeded state without being removed, and the workflow itself would not progress.

After running a jstack on the Nextflow Java process, I identified a thread hanging at this line in TraceRecord.groovy:
TraceRecord.groovy#L422.

To resolve this, I replaced the Groovy methods with native Java functions. After using this change in production for over six months, the problem has not recurred, so I am contributing it with confidence that it will benefit other Nextflow and Kubernetes users.

We experienced this issue on three different clusters, all running Ceph with Kubernetes.

…ere Nextflow got stuck reading the trace File.

Signed-off-by: Lehmann_Fabian <[email protected]>
Copy link

netlify bot commented Nov 13, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit c2faac1
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/6734cd1d2d9e6c00085ae062

@bentsherman
Copy link
Member

That line should be delegating to this method

It is slightly different from your approach, maybe it is getting into an infinite loop

@Lehmann-Fabian
Copy link
Contributor Author

Yes, and the BufferedReader is created by: Files.newBufferedReader(self, Charset.defaultCharset());.
The jstack trace is always in this line: IOGroovyMethods.java#L862
There may be a timeout that is not caught in the distributed file system as an infinite loop would cause the answer StringBuilder to fill until OOM.
Anyway, the new implementation does not have the problem and is even a little bit more memory efficient, as it reads the trace file line by line ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants