Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -165,17 +165,58 @@ int getPodTemplateCount(String podTemplate) {
return podTemplateCounts.getOrDefault(podTemplate, 0);
}

/**
* Unregister executors using persistent field values (cloudName and templateId).
* This method works correctly even after Jenkins restart when transient template references are null.
*
* @param cloudName the kubernetes cloud name
* @param podTemplateId the pod template ID
* @param numExecutors the number of executors (pretty much always 1)
*/
public void unregisterByIds(@NonNull String cloudName, @NonNull String podTemplateId, int numExecutors) {
if (initInstance()) {
synchronized (this) {
// Only unregister if the counts exist (node was actually registered)
int currentGlobalCount = getGlobalCount(cloudName);
int currentPodTemplateCount = getPodTemplateCount(podTemplateId);

if (currentGlobalCount > 0 || currentPodTemplateCount > 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please consider decrement each counter independently, like different template but same cloud

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - we should check each counter independently.
I'll update this to:

  • Check and decrement global count if > 0
  • Check and decrement pod template count if > 0 separately

Will push the fix shortly.

int newGlobalCount = currentGlobalCount - numExecutors;
if (newGlobalCount < 0) {
LOGGER.log(
Level.WARNING,
"Global count for " + cloudName
+ " went below zero. There is likely a bug in kubernetes-plugin");
}
cloudCounts.put(cloudName, Math.max(0, newGlobalCount));
LOGGER.log(Level.FINEST, () -> cloudName + " global limit: " + Math.max(0, newGlobalCount));

int newPodTemplateCount = currentPodTemplateCount - numExecutors;
if (newPodTemplateCount < 0) {
LOGGER.log(
Level.WARNING,
"Pod template count for " + podTemplateId
+ " went below zero. There is likely a bug in kubernetes-plugin");
}
podTemplateCounts.put(podTemplateId, Math.max(0, newPodTemplateCount));
LOGGER.log(
Level.FINEST, () -> podTemplateId + " template limit: " + Math.max(0, newPodTemplateCount));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: unregister function also logs the current cloud.getContainerCap() to put in context this info

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the logging consistency. I'll add the containerCap to the
log message to match the other unregister methods.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see changes on that front, is this expected ?

}
}
}
}

@Extension
public static class NodeListenerImpl extends NodeListener {
@Override
protected void onDeleted(@NonNull Node node) {
if (node instanceof KubernetesSlave) {
KubernetesProvisioningLimits instance = KubernetesProvisioningLimits.get();
KubernetesSlave kubernetesNode = (KubernetesSlave) node;
PodTemplate template = kubernetesNode.getTemplateOrNull();
if (template != null) {
instance.unregister(kubernetesNode.getKubernetesCloud(), template, node.getNumExecutors());
}
// Use persistent fields (cloudName, templateId) instead of transient template object
// This works correctly even after Jenkins restart when template reference is null
instance.unregisterByIds(
kubernetesNode.getCloudName(), kubernetesNode.getTemplateId(), node.getNumExecutors());
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,48 @@ public void lotsOfCloudsAndTemplates() throws InterruptedException {
}
}
}

@Test
public void testCounterDecrementAfterRestartWithEphemeralTemplate() throws Exception {
// Create a cloud with an ephemeral template
KubernetesCloud cloud = new KubernetesCloud("test-cloud");
cloud.setContainerCap(10);
j.jenkins.clouds.add(cloud);

// Create an ephemeral pod template (not saved to cloud config)
PodTemplate ephemeralTemplate = new PodTemplate();
ephemeralTemplate.setName("ephemeral-template");
ephemeralTemplate.setInstanceCap(5);

// Register the template (simulates agent creation)
KubernetesProvisioningLimits limits = KubernetesProvisioningLimits.get();
assertTrue("Should successfully register template", limits.register(cloud, ephemeralTemplate, 1));

// Get the template ID that was auto-generated
String templateId = ephemeralTemplate.getId();

// Verify counters were incremented after registration
assertEquals("Global count should be 1 after registration", 1, limits.getGlobalCount("test-cloud"));
assertEquals("Template count should be 1 after registration", 1, limits.getPodTemplateCount(templateId));

// Create a KubernetesSlave using Builder pattern
KubernetesSlave slave = new KubernetesSlave.Builder()
.podTemplate(ephemeralTemplate)
.cloud(cloud)
.nodeDescription("Test agent for counter leak fix")
.build();

// Add the slave to Jenkins
j.jenkins.addNode(slave);

// Remove the node (simulates agent deletion)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloud.removeTemplate(ephemeralTemplate) to simulate the behavior with idle config after build completion (?) . Perhaps a RestartableJenkinsRule could better represents the original issue report

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I'll look into using RestartableJenkinsRule
for a better test. Would you prefer I do that in this PR or as a
follow-up improvement?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling cloud.removeTemplate seems sufficient to me. I typically begin by writing the test before applying the fix, so if I recall correctly, this test should now fail without your modifications in src/main

// The fix ensures counters are decremented using cloudName and templateId,
// even when template reference is null (as happens with ephemeral templates after restart)
j.jenkins.removeNode(slave);

// After deletion, counters should be decremented back to 0
// This is the bug fix: should work even when template reference is null
assertEquals("Global count should be 0 after node deletion", 0, limits.getGlobalCount("test-cloud"));
assertEquals("Template count should be 0 after node deletion", 0, limits.getPodTemplateCount(templateId));
}
}