Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: properly garbage collecting orphaned network interfaces #642

Merged
merged 40 commits into from
Jan 30, 2025

Conversation

Bryce-Soghigian
Copy link
Collaborator

@Bryce-Soghigian Bryce-Soghigian commented Jan 8, 2025

Fixes #92

Description
This introduces a garbage collection controller that will remove network interfaces that are orphaned.

How was this change tested?

  • make test
  • make az-all + scale up

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

Properly garbage collect network interfaces.

@Bryce-Soghigian
Copy link
Collaborator Author

Testing this locally.

Initially after our first scaleup, we were at 810 network interfaces.

After garbage collection ran we can see that we deleted 435 network interfaces, and were left with 375 for our 375 nodes.

~/dev/focus/karpenter-provider-azure (bsoghigian/network-interface-gc*) » cat logs.txt| grep "garbage collected NIC" | wc -l sillygoose@Bryces-MacBook-Pro
435

image

Copy link
Collaborator

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realize this is still draft to be cleaned up, so apologies if feedback is too early. Leaving some comments, overall I think I would be tempted to have a separate "networkinterface.garbagecollection" controller - unless there is a good reason not to.

pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclaim/garbagecollection/controller.go Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Jan 10, 2025

Pull Request Test Coverage Report for Build 12738183194

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 100 of 190 (52.63%) changed or added relevant lines in 8 files are covered.
  • 21 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 95.275%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/nodeclaim/garbagecollection/instance_garbagecollection.go 4 5 80.0%
pkg/providers/instance/instance.go 23 28 82.14%
pkg/providers/instance/arglist.go 42 58 72.41%
pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go 0 68 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/providers/instance/armutils.go 1 65.85%
pkg/providers/instance/instance.go 20 83.87%
Totals Coverage Status
Change from base Build 12661198057: -0.1%
Covered Lines: 48191
Relevant Lines: 50581

💛 - Coveralls

@tallaxes tallaxes added area/networking Issues or PRs related to networking area/provisioning Issues or PRs related to provisioning (instance provider) area/error-handling Issues or PRs related to handling of errors labels Jan 14, 2025
@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review January 15, 2025 23:52
@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/network-interface-gc branch from 9eb0362 to 20c6625 Compare January 16, 2025 10:40
Copy link
Collaborator

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the separate controller! Please take a look at questions around the controller and use of cache first, everything else is minor and/or easy to fix.

pkg/cloudprovider/cloudprovider.go Outdated Show resolved Hide resolved
pkg/test/networkinterfaces.go Outdated Show resolved Hide resolved
pkg/test/virtualmachines.go Outdated Show resolved Hide resolved
pkg/test/networkinterfaces.go Outdated Show resolved Hide resolved
pkg/fake/azureresourcegraphapi.go Outdated Show resolved Hide resolved
Comment on lines +128 to +97
if err != nil {
logging.FromContext(ctx).Error(err)
return
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instance GC controller combines the errors and returns them, here they are only logged an never returned. Why the difference in behaviour?

Copy link
Collaborator Author

@Bryce-Soghigian Bryce-Soghigian Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried looking into the original pr to see if we got any rationality behind that: aws/karpenter-provider-aws#3405

Didn't see much.

We shouldn't be requeuing more frequently in the presence of errors. We should probably change that, if VM Deletion fails due to a TooManyRequests error, we can't retry until the quota window resets(After 5 minutes?)

2m might make some sense for vms as the volume of vms that need garbage collection is in most cases significantly lower than the volume of interfaces that need garbage collection.

We could try requeuing depending on the kind of error at different intervals.

If we see TooManyRequests Error wait 5 minutes to retry, if we see IsNicReservedForAnotherVM retry 3 minutes later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it seems simpler to just not do all of that requeuing and come and optimize it later if we are optimizing for performance. For now just requeuing every 5 minutes seems to be fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the explanation. Proper treatment of retries and rate limiting is not trivial, let's not tackle it here. I am Ok with keeping constant retry period here for now.

@johnthompson-ybor
Copy link

I'd really love to see this merged, this is causing regular outages for us.

@Bryce-Soghigian
Copy link
Collaborator Author

Ci here is failing due to vulncheck: #665 should fix it

@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/network-interface-gc branch from e1c79e5 to c0e4043 Compare January 28, 2025 02:37
@Bryce-Soghigian
Copy link
Collaborator Author

Marked resolved with no comment or action. You feel strongly about keeping this? Or plan to address later?
I removed the uses but seem to have forgotten to remove the error itself. Will get that in.

Copy link
Collaborator

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with suggestions

pkg/test/networkinterfaces.go Outdated Show resolved Hide resolved
Comment on lines +343 to +344
nic := test.Interface(test.InterfaceOptions{NodepoolName: nodePool.Name})
nic2 := test.Interface(test.InterfaceOptions{NodepoolName: nodePool.Name})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a reason to test two - please capture it in a comment

ExpectApplied(ctx, env.Client, nodeClaim)

// Create a managed NIC
nic := test.Interface(test.InterfaceOptions{Name: instance.GenerateResourceName(nodeClaim.Name), NodepoolName: nodePool.Name})
Copy link
Collaborator

@tallaxes tallaxes Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I suggested elsewhere to collapse these for readability, but I meant (and included in the example) only test.Interface(test.InterfaceOptions{ part, like this:

Suggested change
nic := test.Interface(test.InterfaceOptions{Name: instance.GenerateResourceName(nodeClaim.Name), NodepoolName: nodePool.Name})
nic := test.Interface(test.InterfaceOptions{
Name: instance.GenerateResourceName(nodeClaim.Name),
NodepoolName: nodePool.Name,
})

This is the pattern that we use in many other places for readability. The rationale is that the type names are usually boilerplate, while the field values are usually of interest.

Expect(len(nicsAfterGC)).To(Equal(1))

})
It("the vm gc controller should remove the nic if there is an associated vm", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to read the code to understand what it does; maybe belongs under VM GC test section? Or just a better description:

Suggested change
It("the vm gc controller should remove the nic if there is an associated vm", func() {
It("the VM GC controller should also delete the associated NIC", func() {

Comment on lines +56 to +59
var (
vmListQuery string
nicListQuery string
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a leftover from VM GC implementation, but not clear to me why these need to be global rather than DefaultProvider fields

})
It("should only list nics that belong to karpenter", func() {
managedNic := test.Interface(test.InterfaceOptions{NodepoolName: nodePool.Name})
unmanagedNic := test.Interface(test.InterfaceOptions{Tags: map[string]*string{"kubernetes.io/cluster/test-cluster": lo.ToPtr("random-aks-vm")}})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the degree of collapsing as elsewhere ...

// VirtualMachineOptions customizes an Azure Virtual Machine for testing.
type VirtualMachineOptions struct {
Name string
NodepoolName string
Copy link
Collaborator

@tallaxes tallaxes Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having VM - even test VM - or interface be internally aware of the Nodepool is a bad smell. I think I see why you are doing this - to handle tags automatically - but that pushes even more higher level logic into the test helpers. Overall, I don't think this is an improvement.

Ok, I see now this was maybe in response to my question

What happens if options.Tags are not passed?

I was just asking about whether it would be tolerated / won't crash. Did not mean to suggest they should be calculated automatically (and definetely not at the expense of having to pass nodepool name). Will be more clear next time.

@Bryce-Soghigian Bryce-Soghigian merged commit 7563bca into main Jan 30, 2025
15 of 16 checks passed
@Bryce-Soghigian Bryce-Soghigian deleted the bsoghigian/network-interface-gc branch January 30, 2025 01:45
Bryce-Soghigian added a commit that referenced this pull request Jan 30, 2025
* feat: adding ListNics to instanceprovider interface alongside a refactor of arg related functions to their own file

* feat: adding garbage collection logic for network interfaces and refactoring gc functions slightly

* feat: working poc of ARG Queries and nic garbage collection, need to fix tests

* fix: tests failing due to ListVM

* feat: split nic and vm into their own gc controllers, added shared state between them to prevent conflicts in nic deletion calls

* feat: Add DeleteNic option to instance provider

* docs(code-comment): adding clarification to unremovableNics

* test: instanceprovider.ListNics barebones test and arg fake list nic impl

* fix: bootstrap.sh

* feat: adding in VM List into the networkinterface.garbagecollection controller to avoid attempts to delete nics that are managed by a vm

* refactor: unremovableNics out of the vm gc controller in favor for a cleaner state list

* fix:updating references to cache

* Update pkg/controllers/nodeclaim/garbagecollection/nic_garbagecollection.go

* test: adding composable network interface options to our test utils based on existing karp-core pattern

* test: adding test we don't include unmanaged nics

* test: adding network garbage collection suite and happy path

* test: adding tests for unremovable nics

* test: adding coverage that vm controller cleans up nics

* refactor: renaming controller

* fix: refactor name

* refactor: using import directly

* ci: checking error for controller

* fix: ci

* fix: addressing comments

* Update pkg/controllers/nodeclaim/garbagecollection/instance_garbagecollection.go

* refactor: removing name constant

* refactor: moving to test utils

* fix: removing GetZoneID

* style: improving the readability of the network interface garbage collection tests

* revert: removing lo.FromPtr checks for nodeclaim creation to avoid creating nodeclaims with empty required properties

* refactor: VirtualmachineProperties needs a default for time created

* fix: modifying the tests to be aware of time created to properly simulate vm gc

* refactor: added nodepoolName to test.VirtualMachine and test.Interface

* refactor: moving vm gc tests to use test.VirtualMachine

* refactor: renaming arg files to azureresourcegraph

* refactor: using deleteIfNicExists directly in cleanupAzureResources

* test: createNicFromQueryResponseData missing id, missing name and happy case

* refactor: using fake.Region for the default region for test.VirtualMachine and test.NetworkInterface

* fix: using input.InterfaceName rather than the outer scope interface name

* test: modifying fake to only initialize the query once
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/error-handling Issues or PRs related to handling of errors area/networking Issues or PRs related to networking area/provisioning Issues or PRs related to provisioning (instance provider)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NicReservedForAnotherVM Error leaves nics in MC Resource Group
5 participants