-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: properly garbage collecting orphaned network interfaces #642
Conversation
Testing this locally. Initially after our first scaleup, we were at 810 network interfaces. After garbage collection ran we can see that we deleted 435 network interfaces, and were left with 375 for our 375 nodes. ~/dev/focus/karpenter-provider-azure (bsoghigian/network-interface-gc*) » cat logs.txt| grep "garbage collected NIC" | wc -l sillygoose@Bryces-MacBook-Pro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realize this is still draft to be cleaned up, so apologies if feedback is too early. Leaving some comments, overall I think I would be tempted to have a separate "networkinterface.garbagecollection" controller - unless there is a good reason not to.
Pull Request Test Coverage Report for Build 12738183194Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go
Outdated
Show resolved
Hide resolved
9eb0362
to
20c6625
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the separate controller! Please take a look at questions around the controller and use of cache first, everything else is minor and/or easy to fix.
pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go
Outdated
Show resolved
Hide resolved
if err != nil { | ||
logging.FromContext(ctx).Error(err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instance GC controller combines the errors and returns them, here they are only logged an never returned. Why the difference in behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried looking into the original pr to see if we got any rationality behind that: aws/karpenter-provider-aws#3405
Didn't see much.
We shouldn't be requeuing more frequently in the presence of errors. We should probably change that, if VM Deletion fails due to a TooManyRequests error, we can't retry until the quota window resets(After 5 minutes?)
2m might make some sense for vms as the volume of vms that need garbage collection is in most cases significantly lower than the volume of interfaces that need garbage collection.
We could try requeuing depending on the kind of error at different intervals.
If we see TooManyRequests Error wait 5 minutes to retry, if we see IsNicReservedForAnotherVM retry 3 minutes later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it seems simpler to just not do all of that requeuing and come and optimize it later if we are optimizing for performance. For now just requeuing every 5 minutes seems to be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the explanation. Proper treatment of retries and rate limiting is not trivial, let's not tackle it here. I am Ok with keeping constant retry period here for now.
pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go
Outdated
Show resolved
Hide resolved
pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go
Outdated
Show resolved
Hide resolved
I'd really love to see this merged, this is causing regular outages for us. |
pkg/controllers/nodeclaim/garbagecollection/instance_garbagecollection.go
Show resolved
Hide resolved
Ci here is failing due to vulncheck: #665 should fix it |
…tor of arg related functions to their own file
…ctoring gc functions slightly
…ate between them to prevent conflicts in nic deletion calls
…ontroller to avoid attempts to delete nics that are managed by a vm
…cleaner state list
…eating nodeclaims with empty required properties
e1c79e5
to
c0e4043
Compare
…chine and test.NetworkInterface
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with suggestions
nic := test.Interface(test.InterfaceOptions{NodepoolName: nodePool.Name}) | ||
nic2 := test.Interface(test.InterfaceOptions{NodepoolName: nodePool.Name}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a reason to test two - please capture it in a comment
ExpectApplied(ctx, env.Client, nodeClaim) | ||
|
||
// Create a managed NIC | ||
nic := test.Interface(test.InterfaceOptions{Name: instance.GenerateResourceName(nodeClaim.Name), NodepoolName: nodePool.Name}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I suggested elsewhere to collapse these for readability, but I meant (and included in the example) only test.Interface(test.InterfaceOptions{
part, like this:
nic := test.Interface(test.InterfaceOptions{Name: instance.GenerateResourceName(nodeClaim.Name), NodepoolName: nodePool.Name}) | |
nic := test.Interface(test.InterfaceOptions{ | |
Name: instance.GenerateResourceName(nodeClaim.Name), | |
NodepoolName: nodePool.Name, | |
}) |
This is the pattern that we use in many other places for readability. The rationale is that the type names are usually boilerplate, while the field values are usually of interest.
Expect(len(nicsAfterGC)).To(Equal(1)) | ||
|
||
}) | ||
It("the vm gc controller should remove the nic if there is an associated vm", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to read the code to understand what it does; maybe belongs under VM GC test section? Or just a better description:
It("the vm gc controller should remove the nic if there is an associated vm", func() { | |
It("the VM GC controller should also delete the associated NIC", func() { |
var ( | ||
vmListQuery string | ||
nicListQuery string | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a leftover from VM GC implementation, but not clear to me why these need to be global rather than DefaultProvider fields
}) | ||
It("should only list nics that belong to karpenter", func() { | ||
managedNic := test.Interface(test.InterfaceOptions{NodepoolName: nodePool.Name}) | ||
unmanagedNic := test.Interface(test.InterfaceOptions{Tags: map[string]*string{"kubernetes.io/cluster/test-cluster": lo.ToPtr("random-aks-vm")}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the degree of collapsing as elsewhere ...
// VirtualMachineOptions customizes an Azure Virtual Machine for testing. | ||
type VirtualMachineOptions struct { | ||
Name string | ||
NodepoolName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having VM - even test VM - or interface be internally aware of the Nodepool is a bad smell. I think I see why you are doing this - to handle tags automatically - but that pushes even more higher level logic into the test helpers. Overall, I don't think this is an improvement.
Ok, I see now this was maybe in response to my question
What happens if options.Tags are not passed?
I was just asking about whether it would be tolerated / won't crash. Did not mean to suggest they should be calculated automatically (and definetely not at the expense of having to pass nodepool name). Will be more clear next time.
* feat: adding ListNics to instanceprovider interface alongside a refactor of arg related functions to their own file * feat: adding garbage collection logic for network interfaces and refactoring gc functions slightly * feat: working poc of ARG Queries and nic garbage collection, need to fix tests * fix: tests failing due to ListVM * feat: split nic and vm into their own gc controllers, added shared state between them to prevent conflicts in nic deletion calls * feat: Add DeleteNic option to instance provider * docs(code-comment): adding clarification to unremovableNics * test: instanceprovider.ListNics barebones test and arg fake list nic impl * fix: bootstrap.sh * feat: adding in VM List into the networkinterface.garbagecollection controller to avoid attempts to delete nics that are managed by a vm * refactor: unremovableNics out of the vm gc controller in favor for a cleaner state list * fix:updating references to cache * Update pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go * test: adding composable network interface options to our test utils based on existing karp-core pattern * test: adding test we don't include unmanaged nics * test: adding network garbage collection suite and happy path * test: adding tests for unremovable nics * test: adding coverage that vm controller cleans up nics * refactor: renaming controller * fix: refactor name * refactor: using import directly * ci: checking error for controller * fix: ci * fix: addressing comments * Update pkg/controllers/nodeclaim/garbagecollection/instance_garbagecollection.go * refactor: removing name constant * refactor: moving to test utils * fix: removing GetZoneID * style: improving the readability of the network interface garbage collection tests * revert: removing lo.FromPtr checks for nodeclaim creation to avoid creating nodeclaims with empty required properties * refactor: VirtualmachineProperties needs a default for time created * fix: modifying the tests to be aware of time created to properly simulate vm gc * refactor: added nodepoolName to test.VirtualMachine and test.Interface * refactor: moving vm gc tests to use test.VirtualMachine * refactor: renaming arg files to azureresourcegraph * refactor: using deleteIfNicExists directly in cleanupAzureResources * test: createNicFromQueryResponseData missing id, missing name and happy case * refactor: using fake.Region for the default region for test.VirtualMachine and test.NetworkInterface * fix: using input.InterfaceName rather than the outer scope interface name * test: modifying fake to only initialize the query once
Fixes #92
Description
This introduces a garbage collection controller that will remove network interfaces that are orphaned.
How was this change tested?
Does this change impact docs?
Release Note