Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone Egress NAT #299

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

p-strusiewiczsurmacki-mobica

This PR introduces Standalone Egress NAT as discussed in #274 .

  • coil-controller is now divided into coil-ipam-controller and coil-egress-controller.
  • coild has now configuration flags to disable/enable egress and/or IPAM features.
  • coil has now capabilities fields that can be used to disable/enable IPAM/Egress.
  • If IPAM is disabled, coild can be run as secondary CNI using CNI chaining as described in setup.md.
  • In case IPAM is disabled, veth aliases will now use pod's UUID instead of container's ID (couldn't get e2e test working using container's ID, I believe container ID is changed during pod restart, but as IPAM is disabled it is not updated as required).

PR was tested using egress related E2E tests with both Kindnet and Calico and tests that are provided in the repository passed.

@terassyi
Copy link
Contributor

@p-strusiewiczsurmacki-mobica
I'm reviewing this, so please wait a little longer.

v2/pkg/cnirpc/cni.proto Outdated Show resolved Hide resolved
v2/config/pod/coil-egress-controller.yaml Show resolved Hide resolved
v2/cmd/coil-egress-controller/sub/root.go Outdated Show resolved Hide resolved
v2/runners/coild_server.go Outdated Show resolved Hide resolved
v2/runners/coild_server.go Outdated Show resolved Hide resolved
v2/runners/coild_server.go Show resolved Hide resolved
v2/runners/coild_server_test.go Outdated Show resolved Hide resolved
v2/pkg/nodenet/pod.go Outdated Show resolved Hide resolved
v2/controllers/egress_watcher.go Outdated Show resolved Hide resolved
v2/pkg/nodenet/pod.go Show resolved Hide resolved
@terassyi
Copy link
Contributor

PR was tested using egress related E2E tests with both Kindnet and Calico and tests that are provided in the repository passed.

Thank you for checking it.
I also ran stand-alone egress mode with Kindnet, and it works well.

It's better if we can run e2e tests for stand-alone egress mode.
What do you think?

@terassyi
Copy link
Contributor

Hi, @p-strusiewiczsurmacki-mobica !

Thank you for waiting.

The main changes look good to me.
So I left some nit comments.
Please check it.

@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica force-pushed the standalone-egress branch 2 times, most recently from 66abdf2 to df10c75 Compare September 24, 2024 15:00
@p-strusiewiczsurmacki-mobica
Copy link
Author

PR was tested using egress related E2E tests with both Kindnet and Calico and tests that are provided in the repository passed.

Thank you for checking it. I also ran stand-alone egress mode with Kindnet, and it works well.

It's better if we can run e2e tests for stand-alone egress mode. What do you think?

I agree this would be great. However I need to think about how this test should be done. The problem with kindnet is, that it requires different CNI config on each node, so I would need to add some script that would get node subnet from the kind cluster's nodes after the cluster was created and use that data to update CNI config after make coil-install is executed.

@terassyi
Copy link
Contributor

I would need to add some script that would get node subnet from the kind cluster's nodes after the cluster was created and use that data to update CNI config after make coil-install is executed.

Yes, it isn't easy to configure the cluster to use kindnet and coil.

OK, in this PR, it's OK no e2e test for egress-only mode.
Thanks.

@p-strusiewiczsurmacki-mobica
Copy link
Author

@terassyi I've added egress-only e2e tests to e2e/Makefile and instructions on how to use those to docs/setup.md.

@terassyi terassyi self-requested a review October 4, 2024 01:45
v2/e2e/Makefile Outdated Show resolved Hide resolved
v2/e2e/coil_test.go Outdated Show resolved Hide resolved
docs/setup.md Outdated Show resolved Hide resolved
@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica force-pushed the standalone-egress branch 2 times, most recently from 59227cc to 2eb29ee Compare October 10, 2024 16:06
v2/e2e/coil_test.go Outdated Show resolved Hide resolved
v2/e2e/Makefile Outdated Show resolved Hide resolved
v2/e2e/coil_test.go Outdated Show resolved Hide resolved
@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica force-pushed the standalone-egress branch 2 times, most recently from 2c074c1 to 8958f4b Compare October 17, 2024 14:55
@p-strusiewiczsurmacki-mobica
Copy link
Author

@terassyi Just one important thing - it turned out that egress-controller pods has to use hostNetwork as, otherwise, it seems to be unable to reach the apiserver if Coild is used as primary CNI.
So I've added the hostNetwork to the manifest once again.

@terassyi
Copy link
Contributor

@p-strusiewiczsurmacki-mobica
Thank you for reporting.

So I've added the hostNetwork to the manifest once again.

In this PR, it's ok to turn on hostNetwork.

When I tried to turn out hostNetwork, I found that egess-controller couldn't communicate with the api-server.
I'm not sure why it can't reach api-server.

But, I think hostNetwork is an unnecessary setting.
So, I will create an investigation issue after merging this.

@terassyi
Copy link
Contributor

https://github.com/cybozu-go/coil/pull/299/files#diff-87c0c48de880a4de757f70cabed8e310b213adaebd0974c60386af47de4aeb8bR81

Could you update the manifests target to be able to select generating manifests for only ipam or egress related or both?

@p-strusiewiczsurmacki-mobica
Copy link
Author

https://github.com/cybozu-go/coil/pull/299/files#diff-87c0c48de880a4de757f70cabed8e310b213adaebd0974c60386af47de4aeb8bR81

Could you update the manifests target to be able to select generating manifests for only ipam or egress related or both?

@terassyi Done. :)

manifests, manifests-egress and manifests-ipam are now available to be used.

It seems that controller-gen only accepts directories for manifest generation, so I had to add some workaround for that (copying *_webhook.go files to temporary folder) but I think it should work OK.

@terassyi
Copy link
Contributor

When trying to run e2e test, it fails with following error.

Error: accumulating resources: accumulation err='accumulating resources from '../../../../config/default/egress/v4': read /home/terashima/go/src/github.com/cybozu-go/coil/v2/config/default/egress/v4: is a directory': recursed accumulation of path '/home/terashima/go/src/github.com/cybozu-go/coil/v2/config/default/egress/v4': trouble configuring builtin PatchStrategicMergeTransformer with config: `
paths:
- ../../webhook_manifests_patch.yaml
`: open /home/terashima/go/src/github.com/cybozu-go/coil/v2/config/default/webhook_manifests_patch.yaml: no such file or directory
error: no objects passed to apply

I ran following commands.

$ make -C .. manifests
$ WITH_KINDNET=true TEST_IPV6=false make start
$ make install-coil-egress-v4

@p-strusiewiczsurmacki-mobica
Copy link
Author

@terassyi

Error: accumulating resources: accumulation err='accumulating resources from '../../../../config/default/egress/v4': read /home/terashima/go/src/github.com/cybozu-go/coil/v2/config/default/egress/v4: is a directory': recursed accumulation of path '/home/terashima/go/src/github.com/cybozu-go/coil/v2/config/default/egress/v4': trouble configuring builtin PatchStrategicMergeTransformer with config: `
paths:
- ../../webhook_manifests_patch.yaml
`: open /home/terashima/go/src/github.com/cybozu-go/coil/v2/config/default/webhook_manifests_patch.yaml: no such file or directory
error: no objects passed to apply

I believe you're missing a file v2/config/default/webhook_manifests_patch.yaml which is generated by make certs target (this is something i did not change).

So, you'd have to run

make -C .. manifests && make -C .. certs
WITH_KINDNET=true TEST_IPV6=false make start
make install-coil-egress-v4

@terassyi
Copy link
Contributor

@p-strusiewiczsurmacki-mobica

Thanks!
I could run e2e tests for each configuration, and it seems all e2e tests work well.

It seems procedures to run e2e tests are getting complex, so cloud you update e2e/README.md?

@terassyi
Copy link
Contributor

It seems webhook-related tests in small test fail.
Could you fix this test?

@terassyi
Copy link
Contributor

Lastly, I want you to update CI to run e2e tests for all combinations, such as egress-only and egress-only, with ipv6 and egress+ipam, etc.

When you want to run CI, please mention me.
I respond as quickly as I can.

@p-strusiewiczsurmacki-mobica
Copy link
Author

@terassyi Small test should be fixed and actions were added to the CI.

Should I squash the commits before it'll be merged?

@terassyi
Copy link
Contributor

terassyi commented Nov 1, 2024

Thanks!

It passes all CIs.

Should I squash the commits before it'll be merged?

Yes, please.

@p-strusiewiczsurmacki-mobica
Copy link
Author

@terassyi It's squashed now :)

@terassyi
Copy link
Contributor

terassyi commented Nov 5, 2024

@p-strusiewiczsurmacki-mobica

Thanks!

I found small test seems to be flaky.
Cloud you fix this?

https://github.com/cybozu-go/coil/actions/runs/11663759016

Signed-off-by: Patryk Strusiewicz-Surmacki <[email protected]>

Update v2/cmd/coil-egress-controller/sub/root.go
Update v2/runners/coild_server.go
Update v2/pkg/nodenet/pod.go
Update v2/pkg/cnirpc/cni.proto

Co-authored-by: Tomoya Terashima <[email protected]>
@p-strusiewiczsurmacki-mobica
Copy link
Author

p-strusiewiczsurmacki-mobica commented Nov 6, 2024

@terassyi I've fixed the warnings and added some fixes for the CI jobs.
One last thing that is bugging me are the controller-runtime's/go-client's cache errors in the small tests, like this one:

W1105 08:58:09.635773   24935 reflector.go:470] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:232: watch of *v1.PodDisruptionBudget ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

But I believe that this might be out of our reach here, as it seems it is somewhat known issue with controller runtime's cache.

I've tried disabling the cache altogether but for some reason it did not work for me. I believe those were also not introduced by my changes as I can see them in other runs of CI, e.g: https://github.com/cybozu-go/coil/actions/runs/10804727159/job/29970647406

If you see anything else I could fix just let me know.

@RefluxMeds
Copy link

Hi @terassyi,
Have you had time to check this? It would be very beneficial for our org.
Thanks!

@terassyi
Copy link
Contributor

@RefluxMeds

Hi, I'm checking this to be able to merge.
Please wait a little longer!

@terassyi
Copy link
Contributor

terassyi commented Nov 22, 2024

@p-strusiewiczsurmacki-mobica

I re-reviewed all changes again, and I want to discuss the need for enable-spam/egress flags in the coil CNI plugin.

I think it's enough to handle these flags in coild.

Now, we check the features we want to use by values given by the coil CNI plugin.
However, could also has the same flags.
I think we can use coild's enable-ipam/egress flags instead of coil CNI's flags.

https://github.com/cybozu-go/coil/pull/299/files#diff-f7d3c7316366cfd19ef46b412e3ab6937dc7d6607b49a73b613e3fecd639d875R181

The only concern I have now is following the code.

https://github.com/cybozu-go/coil/pull/299/files#diff-e193be60a714369d09504366326bc0f86dff95f1dabd9e1bd16e027dac81acb2R33

But we can solve this by moving this error handling to coild.

If we can keep the CNI configuration simple and no changes, it's better for all users.

…ress-settings-fix

Standalone egress settings fix
@p-strusiewiczsurmacki-mobica
Copy link
Author

p-strusiewiczsurmacki-mobica commented Nov 25, 2024

@terassyi
OK, I've removed the capabilities settings from the CNI binary itself. The configuration will be now handled by coild flags. I've also moved the check you've mentioned to coild and changed small tests a bit to reflect that.

"missing pod name/namespace", fmt.Sprintf("%+v", args.Args))
isChained, err := getSettings(args)
if err != nil {
return nil, newInternalError(fmt.Errorf("runtime error"), "failed to get CNi arguments")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, newInternalError(fmt.Errorf("runtime error"), "failed to get CNi arguments")
return nil, newInternalError(fmt.Errorf("runtime error"), "failed to get CNI arguments")

@@ -88,7 +92,8 @@ func (n natSetup) Hook(l []GWNets, log *zap.Logger) func(ipv4, ipv6 net.IP) erro
}

// NewCoildServer returns an implementation of cnirpc.CNIServer for coild.
func NewCoildServer(l net.Listener, mgr manager.Manager, nodeIPAM ipam.NodeIPAM, podNet nodenet.PodNetwork, setup NATSetup, logger *zap.Logger) manager.Runnable {
func NewCoildServer(l net.Listener, mgr manager.Manager, nodeIPAM ipam.NodeIPAM, podNet nodenet.PodNetwork, setup NATSetup, cfg *config.Config, logger *zap.Logger,
aliasFunc func(interfaces map[string]bool, conf *nodenet.PodNetConf, logger *zap.Logger, pod *corev1.Pod) error) manager.Runnable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add the callback function to add the interface alias?

if err != nil {
logger.Sugar().Errorw("failed to allocate address", "error", err)
return nil, newInternalError(err, "failed to allocate address")
if !s.cfg.EnableIPAM && !s.cfg.EnableEgress {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is enough to check only when starting the cold server.

if err != nil {
return fmt.Errorf("netlink: failed to look up the host-side veth [%s]: %w", ifName, err)
}
logger.Sugar().Infof("link found: %v", hLink)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this log.

@terassyi terassyi self-requested a review December 2, 2024 08:56
@terassyi
Copy link
Contributor

terassyi commented Dec 2, 2024

@p-strusiewiczsurmacki-mobica
Thank you for fixing!

I added some comments :)
Please check it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants