Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky] When main path ISL is UP, and ISL of protected path becomes active, and other non-involved ISLs have not enough bandwidth, the flow does not become UP, it stays degraded with “protected-path”: “Down” #5655

Open
izadorozhna opened this issue May 7, 2024 · 0 comments

Comments

@izadorozhna
Copy link
Collaborator

izadorozhna commented May 7, 2024

Steps to reproduce with the automated test:

  1. Go to a test spec "Flow swaps to protected path when main path gets broken, becomes DEGRADED if protected path is unable to reroute(no bw)"
  2. Change the code to select the 7 and 8 switches as a pair:
         given: "Two switches with 2 diverse paths at least"
         //def switchPair = switchPairs.all().withAtLeastNNonOverlappingPaths(2).random()
         //https://github.com/telstra/open-kilda/issues/5608
-        def switchesWhere5608IsReproducible = topology.activeSwitches.findAll {it.dpId.toString().endsWith("08")
-        ||it.dpId.toString().endsWith("09")}
+        def switches_7_and_8 = topology.activeSwitches.findAll {it.dpId.toString().endsWith("07")
+                ||it.dpId.toString().endsWith("08")}
         def switchPair = switchPairs.all()
-                .excludeSwitches(switchesWhere5608IsReproducible)
+                .includeSwitch(switches_7_and_8[0])
+                .includeSwitch(switches_7_and_8[1])
                 .withAtLeastNNonOverlappingPaths(2).random()
  1. Execute the test. So, it will be executed with switches 7 and 8.
  2. If the test passes, repeat step 3.
  3. When the test fails on the step when the main ISL is restored, and the flow is expected to be UP, but it is Degraded, the issue is reproduced.

Steps to reproduce with the manually:

  1. Select switches 7 and 8 and create a flow with a protected path. Usually, such flow has path size 2 (7<-->8) for both main and protected paths.
  2. Select all non-involved ISLs into the main or protected path of the flow and decrease the BW there to a minimum.
  3. Break the ISL(s) of the main path, so the originally protected path swaps to the main, and vice-versa originally main path with broken ISL now becomes the protected path which is down.
  4. Check that now the flow has degraded status because the protected path cannot be found (the original main path ISL is broken and cannot be a new protected path, and other non-involved ISLs have not enough BW).
  5. Restore the original main ISL broken on step 4.
  6. Check that the flow becomes active with main and protected paths UP.

Expected result:

The flow becomes UP:

"status": "Up",
"status-details": {
"main-path": "Up",
"protected-path": "Up"
}

When checking the history, it should have the reroute action after ISL is Active, and since the protected path is already present, earlier it was down due to the broken ISL, and now this ISL is up, the same protected path is found. So, Kilda skipped creating of new protected path:
image

Actual result:

When executing the same test several times (with the same switch pair 7-8), the result is not consistent. Sometimes, the expected result is received. But sometimes, after the main ISL is restored, the flow still stays in the Degraded state with the “protected-path”: “Down”:

"status": "Degraded",
"status-details": {
"main-path": "Up",
"protected-path": "Down"
},
"status_info": "Couldn't find non overlapping protected path",

However, the history has the route action after ISL became active:
image
But for some reason, this time, it does not have "Found the same protected path. Skipped creating of it" message, but it has "Couldn't find non overlapping protected path. Skipped creating it" instead.

Also, when I try to do the manual explicit reroute action via Northbound V2 API, it helps to reroute the flow and the flow becomes UP. The flow history now has a new reroute action started via Northbound. However, the API response to the reroute action has rerouted: false for some reason:
image

Attaching the flow history JSON which include the manual explicit reroute action as well.
07May180118_375_cinnamon9255.json

Attaching tolopogy.yaml:
topology.yaml.log

P.S. Please note that the test case is flaky and need to repeat the steps several times to reproduce the issue.
Also, it is important to note that there is a separate similar test "Flow swaps to protected path when main path gets broken, becomes DEGRADED if protected path is unable to reroute(no bw)" which has similar steps, but the other (non-involved ISLs into main or protected paths), are broken instead of decreasing BW. In this case, the test also fails sometimes with switch pair 7-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant