Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zebra -K argument and extend e2e testing #231

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

karampok
Copy link
Contributor

@karampok karampok commented Nov 26, 2024

/kind bug

What this PR does / why we need it:
If we use learning routes, when we restart the pod those routes are gone from the k8s node. There is an option for zebra to keep them. That is useful only if Graceful restart is being used.

Special notes for your reviewer:

Release note:

Add by default the -K (graceful_restart) argument in Zebra so that learnt routes are maintained on the host when zebra process restart.

@karampok karampok force-pushed the learning-route-gr branch 2 times, most recently from d60431a to 2a4af86 Compare November 27, 2024 07:52
@karampok karampok force-pushed the learning-route-gr branch 4 times, most recently from 925a212 to 35ba89e Compare December 4, 2024 16:03
@karampok karampok marked this pull request as ready for review December 5, 2024 09:10
@karampok
Copy link
Contributor Author

karampok commented Dec 5, 2024

@fedepaol this is ready for review but probably we need to break into different commits/PR

  1. one PR in metallb for adding a test function
  2. one commit to add the real change
  3. one commit to extent the test

Thanks

@fedepaol
Copy link
Member

fedepaol commented Dec 5, 2024

@fedepaol this is ready for review but probably we need to break into different commits/PR

1. one PR in metallb for adding a test function

2. one commit to add the real change

3. one commit to extent the test

Thanks

Then please break it :-)

@karampok
Copy link
Contributor Author

karampok commented Dec 5, 2024

sure just confirming this is what you prefer

@@ -49,7 +49,7 @@ data:
# Check /etc/pam.d/frr if you intend to use "vtysh"!
#
vtysh_enable=yes
zebra_options=" -A 127.0.0.1 -s 90000000"
zebra_options=" -A 127.0.0.1 -K 120 -s 90000000"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the commit says also --graceful_restart argument but I don't see it.
Also, what's special about 120 seconds? Can we document it (as a comment and in the commit description)?
Also also, would it make sense to have it as a parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -K is the short version arg for the --graceful_restart, I prefered the short, do you prefer to use the long or should improve the commit title?

About the 120seconds, the 120 matches the 120 seconds that graceful-timeout advertise to the other side (https://docs.frrouting.org/en/latest/bgp.html#clicmd-bgp-graceful-restart-restart-time-0-4095)

I do not think it should be parameter because we already hardcode the https://docs.frrouting.org/en/latest/bgp.html#clicmd-bgp-graceful-restart-restart-time-0-4095 and i do not see any benefit.

e2etests/go.mod Outdated
@@ -6,21 +6,21 @@ toolchain go1.22.4

replace (
github.com/metallb/frr-k8s => ../
go.universe.tf/e2etest => github.com/metallb/metallb/e2etest v0.0.0-20240715121012-af3c10d65f18
go.universe.tf/e2etest => github.com/karampok/metallb/e2etest v0.0.0-20240613071516-2d52178fbc89
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this #239 is merged, I would refactor and this PR should not have any changes in the go.mod.

I am drafting the PR until the dependency is merged

e2etests/go.mod Outdated
k8s.io/api v0.31.0
k8s.io/apimachinery v0.31.0
k8s.io/client-go v0.31.0
k8s.io/api v0.31.4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these dep updates relevant?

)
})
})

// checkPrefixOnNodes checks that a prefix has at least one route on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this belong to some pkg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this help function is tailored to the graceful restart scenario because it checks the routes on the host itself (all other test use the routes from the vtysh) and is relaxed on the next hop (because not all nexthops exist as decribed int he coment). So I would not transfer into a pkg.

No objection though if you want that (which pkg do you have in mind)?

})

ginkgo.Context("When restarting the frrk8s deamon pods", func() {
ginkgo.DescribeTable("routes are maintained", func(ipFam ipfamily.Family, prefix, learnRoute string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we clarify that we are checking routes both sides? Would be a single test checking each side clearer?

Copy link
Contributor Author

@karampok karampok Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will extend the description.

Would be a single test checking each side clearer?

Not sure do you mean two tests? Given that test takes 2minutes I would not add another instance of it for time reasons. Also checking in two consistently is not working because I must make sure that routes always exists.

@karampok karampok marked this pull request as draft December 12, 2024 12:05
@karampok karampok marked this pull request as ready for review December 16, 2024 08:27
https://docs.frrouting.org/en/latest/zebra.html#cmdoption-zebra-K

With that argument `-K 120`, when zebra stops it does not clean the
routes that has install on the host. When zebra starts up with, it will
wait 120 seconds before removing (sweeping) any routes that it
previously created but are no longer valid.

The value 120 is used to match the default `bgp gracefule-restart
restart-time` that we advertise to the peer which is also 120seconds.
https://docs.frrouting.org/en/latest/bgp.html#clicmd-bgp-graceful-restart-restart-time-0-4095

This is needed to complete the BGP graceful restart feature when the
local endpoints depend on local routes that are learned from the BGP
peer.

Downside is that if we simply remove FRR (delete deployment or change
speaker label toleration) those routes will remain on the host until
reboot.

Signed-off-by: karampok <[email protected]>
Extend the BGP graceful restart test to verify that both advertised and
learned routes are maintained during the restart of the frr-k8s pods.
This test should passed as long as the zebra process in FRR has the -K
argument.

Signed-off-by: karampok <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants