Skip to content

Conversation

@Abhijeet212004
Copy link

Fixes #2783
Partially addresses #2737

What's the bug?

When you create pod templates inside pipelines (ephemeral templates), Jenkins loses track of them after a restart. The cleanup code has this check:

PodTemplate template = kubernetesNode.getTemplateOrNull();
if (template != null) {
instance.unregister(...);
}

Problem is, after restart the template is null (it's a transient field). So the cleanup never happens and the counter stays incremented forever. Eventually Jenkins thinks it's at capacity even when no pods are running.

How I fixed it

Instead of relying on the template object, I'm now using cloudName and templateId which are always available (they're persistent fields). Added a new method unregisterByIds() that just needs these IDs instead of the whole template object.

The NodeListener now calls this new method without checking if template is null.

What changed

  • New method unregisterByIds() in KubernetesProvisioningLimits
  • Updated NodeListener.onDeleted() to use it
  • Added a test case (though it has some issues with the test harness I need help with)

Testing

Ran mvn clean compile - everything builds fine. Checked the logic manually and it makes sense - cloudName and templateId are definitely persistent, so they'll survive restarts.

The test I added compiles but I'm not 100% sure it's set up right for the Jenkins test environment. Would appreciate a look at that.

FYI

I'm also working on PR #2785 which fixes a different counter leak issue (ImagePullBackOff detection). Both are part of debugging issue #2737. Happy to wait for that one to merge first if you prefer - just wanted to get this out there since I had it working.

Thanks for taking a look!

cc @Vlatombe @jglick - would appreciate your review when you have time!

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

Fixes jenkinsci#2783
Partially addresses jenkinsci#2737

The problem: When you create pod templates in pipelines (ephemeral templates),
their reference becomes null after Jenkins restarts. The cleanup code was
checking 'if (template != null)' before decrementing counters, so it never
ran - causing a permanent leak.

The fix: Use cloudName and templateId instead of the template object. These
fields survive restarts, so counters get cleaned up correctly.
@Abhijeet212004 Abhijeet212004 force-pushed the fix/limit-counter-leak-2783 branch from b699483 to 97ad743 Compare January 6, 2026 11:58
Prevents attempting to decrement counters for nodes that were
never registered (e.g., in eviction scenarios). Checks if counters
exist before decrementing to avoid spurious warnings.
Copy link

@apuig apuig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your work on this, this PR seems to be going in the right direction

int currentGlobalCount = getGlobalCount(cloudName);
int currentPodTemplateCount = getPodTemplateCount(podTemplateId);

if (currentGlobalCount > 0 || currentPodTemplateCount > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please consider decrement each counter independently, like different template but same cloud

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - we should check each counter independently.
I'll update this to:

  • Check and decrement global count if > 0
  • Check and decrement pod template count if > 0 separately

Will push the fix shortly.

// Add the slave to Jenkins
j.jenkins.addNode(slave);

// Remove the node (simulates agent deletion)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloud.removeTemplate(ephemeralTemplate) to simulate the behavior with idle config after build completion (?) . Perhaps a RestartableJenkinsRule could better represents the original issue report

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I'll look into using RestartableJenkinsRule
for a better test. Would you prefer I do that in this PR or as a
follow-up improvement?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling cloud.removeTemplate seems sufficient to me. I typically begin by writing the test before applying the fix, so if I recall correctly, this test should now fail without your modifications in src/main

}
podTemplateCounts.put(podTemplateId, Math.max(0, newPodTemplateCount));
LOGGER.log(
Level.FINEST, () -> podTemplateId + " template limit: " + Math.max(0, newPodTemplateCount));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: unregister function also logs the current cloud.getContainerCap() to put in context this info

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the logging consistency. I'll add the containerCap to the
log message to match the other unregister methods.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see changes on that front, is this expected ?

- Restructure unregisterByIds() to check and decrement counters independently
- Improve test to simulate ephemeral template removal with cloud.removeTemplate()
- Keep logging simple for consistency with existing patterns
@Abhijeet212004
Copy link
Author

Hey @apuig! I've pushed the changes from your feedback (commit 2728ae9):

-Split the counter checks into separate if blocks so each decrements independently
-Added cloud.removeTemplate(ephemeralTemplate) to the test to simulate the idleMinutes scenario
-Kept the logging simple to stay consistent with existing code
All checks are passing. Let me know if anything else needs adjusting - happy to make more changes!

@apuig
Copy link

apuig commented Jan 7, 2026

LGTM so far. I just need to test it manually on a running instance, I’ll try to do that before the end of the week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

idleMinutes config leads to limit counter leak with ephemeral templates (and restart before idle timeout)

2 participants