-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix counter leak when Jenkins restarts with ephemeral templates #2787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix counter leak when Jenkins restarts with ephemeral templates #2787
Conversation
Fixes jenkinsci#2783 Partially addresses jenkinsci#2737 The problem: When you create pod templates in pipelines (ephemeral templates), their reference becomes null after Jenkins restarts. The cleanup code was checking 'if (template != null)' before decrementing counters, so it never ran - causing a permanent leak. The fix: Use cloudName and templateId instead of the template object. These fields survive restarts, so counters get cleaned up correctly.
b699483 to
97ad743
Compare
Prevents attempting to decrement counters for nodes that were never registered (e.g., in eviction scenarios). Checks if counters exist before decrementing to avoid spurious warnings.
apuig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your work on this, this PR seems to be going in the right direction
| int currentGlobalCount = getGlobalCount(cloudName); | ||
| int currentPodTemplateCount = getPodTemplateCount(podTemplateId); | ||
|
|
||
| if (currentGlobalCount > 0 || currentPodTemplateCount > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please consider decrement each counter independently, like different template but same cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - we should check each counter independently.
I'll update this to:
- Check and decrement global count if > 0
- Check and decrement pod template count if > 0 separately
Will push the fix shortly.
| // Add the slave to Jenkins | ||
| j.jenkins.addNode(slave); | ||
|
|
||
| // Remove the node (simulates agent deletion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cloud.removeTemplate(ephemeralTemplate) to simulate the behavior with idle config after build completion (?) . Perhaps a RestartableJenkinsRule could better represents the original issue report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I'll look into using RestartableJenkinsRule
for a better test. Would you prefer I do that in this PR or as a
follow-up improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling cloud.removeTemplate seems sufficient to me. I typically begin by writing the test before applying the fix, so if I recall correctly, this test should now fail without your modifications in src/main
| } | ||
| podTemplateCounts.put(podTemplateId, Math.max(0, newPodTemplateCount)); | ||
| LOGGER.log( | ||
| Level.FINEST, () -> podTemplateId + " template limit: " + Math.max(0, newPodTemplateCount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: unregister function also logs the current cloud.getContainerCap() to put in context this info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the logging consistency. I'll add the containerCap to the
log message to match the other unregister methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see changes on that front, is this expected ?
- Restructure unregisterByIds() to check and decrement counters independently - Improve test to simulate ephemeral template removal with cloud.removeTemplate() - Keep logging simple for consistency with existing patterns
|
Hey @apuig! I've pushed the changes from your feedback (commit 2728ae9): -Split the counter checks into separate if blocks so each decrements independently |
|
LGTM so far. I just need to test it manually on a running instance, I’ll try to do that before the end of the week. |
Fixes #2783
Partially addresses #2737
What's the bug?
When you create pod templates inside pipelines (ephemeral templates), Jenkins loses track of them after a restart. The cleanup code has this check:
PodTemplate template = kubernetesNode.getTemplateOrNull();
if (template != null) {
instance.unregister(...);
}
Problem is, after restart the template is null (it's a transient field). So the cleanup never happens and the counter stays incremented forever. Eventually Jenkins thinks it's at capacity even when no pods are running.
How I fixed it
Instead of relying on the template object, I'm now using cloudName and templateId which are always available (they're persistent fields). Added a new method
unregisterByIds()that just needs these IDs instead of the whole template object.The NodeListener now calls this new method without checking if template is null.
What changed
unregisterByIds()in KubernetesProvisioningLimitsTesting
Ran
mvn clean compile- everything builds fine. Checked the logic manually and it makes sense - cloudName and templateId are definitely persistent, so they'll survive restarts.The test I added compiles but I'm not 100% sure it's set up right for the Jenkins test environment. Would appreciate a look at that.
FYI
I'm also working on PR #2785 which fixes a different counter leak issue (ImagePullBackOff detection). Both are part of debugging issue #2737. Happy to wait for that one to merge first if you prefer - just wanted to get this out there since I had it working.
Thanks for taking a look!
cc @Vlatombe @jglick - would appreciate your review when you have time!
Submitter checklist