Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch GR test failing #213

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

karampok
Copy link
Contributor

/kind flake

/kind failing
/kind documentation
/kind regression

What this PR does / why we need it:

Special notes for your reviewer:

Release note:


@karampok karampok force-pushed the flake-bgp-gr-test branch 4 times, most recently from d2464a2 to 31d9289 Compare October 23, 2024 17:30
@karampok
Copy link
Contributor Author

This error appears

 [FAILED] Failed after 48.868s.
  Unexpected error:
      <*errors.joinError | 0xc0004c3c08>: 
      Neigh ibgp-multi-hop does not have prefix 5.5.5.5/32: IP map[172.18.0.4:{}] found in nodes but not in next hops
      {
          errs: [
              <*fmt.wrapError | 0xc00011a460>{
                  msg: "Neigh ibgp-multi-hop does not have prefix 5.5.5.5/32: IP map[172.18.0.4:{}] found in nodes but not in next hops",
                  err: <*errors.errorString | 0xc000a20100>{
                      s: "IP map[172.18.0.4:{}] found in nodes but not in next hops",
                  },
              },
          ],
      }
  occurred
  In [It] at: /home/runner/work/frr-k8s/frr-k8s/e2etests/tests/graceful_restart.go:147 

which seems is happening because node

2024/10/23 13:48:48 BGP: [VERY4-P6JC8] 172.30.0.2(Unknown) [Update:SEND] 5.5.5.5/32 is filtered by route-map '172.30.0.2-out'

and only observed until now for ibgp-multi-hop

@karampok
Copy link
Contributor Author

A second failure is observed

[FAILED] Timed out after 60.650s.
  route should exist before we restart frr-k8s
  Unexpected error:
      <*fmt.wrapError | 0xc00028[696](https://github.com/metallb/frr-k8s/actions/runs/11484047086/job/31961006892?pr=213#step:8:697)0>: 
      Neigh ibgp-multi-hop does not have prefix 5.5.5.5/32: route not found 5.5.5.5/32
      {
          msg: "Neigh ibgp-multi-hop does not have prefix 5.5.5.5/32: route not found 5.5.5.5/32",
          err: <routes.RouteNotFoundError>{Route: "5.5.5.5/32"},
      }
  occurred
  In [It] at: /home/runner/work/frr-k8s/frr-k8s/e2etests/tests/graceful_restart.go:115 @ 10/23/24 17:21:27.916
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[AfterSuite] 
/home/runner/work/frr-k8s/frr-k8s/e2etests/e2etest_suite_test.go:93
[AfterSuite] PASSED [2.374 seconds]
------------------------------

Summarizing 1 Failure:
  [FAIL] Establish BGP session with EnableGracefulRestart When restarting the frrk8s deamon pods external BGP peer maintains routes [It] IPV4
  /home/runner/work/frr-k8s/frr-k8s/e2etests/tests/graceful_restart.go:115

Ran 1 of 97 Specs in 97.308 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 96 Skipped
--- FAIL: TestE2E (97.34s)
FAIL

Tests failed on attempt #22 

which is happening on the setup and before even GR test starts

@karampok
Copy link
Contributor Author

karampok commented Oct 24, 2024

@fedepaol the investigation showed that there is an option

bgp route-map delay-timer X

the default value was 5, and that seems to make the test failing due to timing issues. I can not find any doc around that in the official doc.

When value 20, the ci runs for 50 times per test (ipv4,ipv6,dual,helm) without a failure*

When value is 0, running the test once locally fails in the same way.

  • one test failed because of docker exec failing

FRRouting/frr#10757
FRRouting/frr#10756

@karampok
Copy link
Contributor Author

karampok commented Oct 25, 2024

  STEP: 	frrk8s pod ARE ready @ 10/25/24 08:34:24.12
• [FAILED] [78.209 seconds]
Establish BGP session with EnableGracefulRestart When restarting the frrk8s deamon pods external BGP peer maintains routes [It] IPV4
/home/runner/work/frr-k8s/frr-k8s/e2etests/tests/graceful_restart.go:150

  [FAILED] Failed after 19.692s.
  Unexpected error:
      <*errors.joinError | 0xc000c7a648>: 
      Neigh ibgp-single-hop does not have prefix 5.5.5.5/32: route not found 5.5.5.5/32
      {
          errs: [
              <*fmt.wrapError | 0xc0007f63c0>{
                  msg: "Neigh ibgp-single-hop does not have prefix 5.5.5.5/32: route not found 5.5.5.5/32",
                  err: <routes.RouteNotFoundError>{Route: "5.5.5.5/32"},
              },
          ],
      }
  occurred
  In [It] at: /home/runner/work/frr-k8s/frr-k8s/e2etests/tests/graceful_restart.go:147

that failure looks different, looking the logs

[I] kka@f-t14s /t/p/IPV4> cat frrdump-ibgp-single-hop.log |grep graceful|grep 172.18.0.2
2024/10/25 08:33:54.061 BGP: [RPZW2-39GTY] 172.18.0.2(frr-k8s-worker2) graceful restart timer started for 120 sec
2024/10/25 08:33:54.061 BGP: [TK2B6-ZF4MR] 172.18.0.2(frr-k8s-worker2) graceful restart stalepath timer started for 360 sec
2024/10/25 08:34:06.911 BGP: [TASMS-1WSKN] 172.18.0.2(frr-k8s-worker2) graceful restart timer stopped
2024/10/25 08:34:06.911 BGP: [P98A2-2RDFE] 172.18.0.2(frr-k8s-worker2) graceful restart stalepath timer stopped

seems all nodes were not healthy and GR ended, test stopped.

Frr containers start tyring to connect at 08:34:06


@karampok karampok force-pushed the flake-bgp-gr-test branch 17 times, most recently from 99fd53d to 9daae72 Compare November 21, 2024 08:09
@karampok karampok force-pushed the flake-bgp-gr-test branch 6 times, most recently from e7c6703 to 0481986 Compare November 25, 2024 11:59
@karampok karampok force-pushed the flake-bgp-gr-test branch 2 times, most recently from 63bf420 to 1ffcf0c Compare December 12, 2024 13:22
- debug in watchfrr
- capture coredump on bgp crashes
- increase prefix scale on ipv4

Signed-off-by: karampok <[email protected]>
Signed-off-by: karampok <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant