Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[active-standby] Fix default route handler race condition #254

Merged
merged 2 commits into from
Jun 21, 2024

Conversation

lolyu
Copy link
Contributor

@lolyu lolyu commented Jun 20, 2024

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • New feature
  • Doc/Design
  • Unit test

Approach

What is the motivation for this PR?

Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok

The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
  • Microsoft ADO (number only): 28471183

How did you do it?

post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?

  • without this PR, UT fail:
lolv@f8c780888096:/sonic/src/repo/sonic-linkmgrd$ ./linkmgrd-test --gtest_filter="*DefaultRouteStateRaceCondition*"
Note: Google Test filter = *DefaultRouteStateRaceCondition*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LinkManagerStateMachineTest
[ RUN      ] LinkManagerStateMachineTest.DefaultRouteStateRaceCondition
test/LinkManagerStateMachineTest.cpp:1567: Failure
Expected equality of these values:
  mFakeMuxPort.getDefaultRoute()
    Which is: 4-byte object <01-00 00-00>
  link_manager::LinkManagerStateMachineBase::DefaultRoute::OK
    Which is: 4-byte object <02-00 00-00>
[  FAILED  ] LinkManagerStateMachineTest.DefaultRouteStateRaceCondition (11327 ms)
[----------] 1 test from LinkManagerStateMachineTest (11327 ms total)
 
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11328 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] LinkManagerStateMachineTest.DefaultRouteStateRaceCondition
 
1 FAILED TEST
  • with this PR, UT pass:
lolv@f8c780888096:/sonic/src/repo/sonic-linkmgrd$ ./linkmgrd-test --gtest_filter="*DefaultRouteStateRaceCondition*"
Note: Google Test filter = *DefaultRouteStateRaceCondition*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LinkManagerStateMachineTest
[ RUN      ] LinkManagerStateMachineTest.DefaultRouteStateRaceCondition
[       OK ] LinkManagerStateMachineTest.DefaultRouteStateRaceCondition (11496 ms)
[----------] 1 test from LinkManagerStateMachineTest (11496 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (11496 ms total)
[  PASSED  ] 1 test.

Any platform specific information?

Documentation

Signed-off-by: Longxiang Lyu <[email protected]>
@yxieca yxieca merged commit d0124f5 into sonic-net:master Jun 21, 2024
9 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-linkmgrd that referenced this pull request Jun 21, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #255

mssonicbld pushed a commit that referenced this pull request Jun 21, 2024
What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <[email protected]>
lolyu added a commit to lolyu/sonic-linkmgrd that referenced this pull request Jun 24, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <[email protected]>
@lolyu lolyu mentioned this pull request Jun 26, 2024
4 tasks
mssonicbld pushed a commit to mssonicbld/sonic-linkmgrd that referenced this pull request Aug 6, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #265

mssonicbld pushed a commit that referenced this pull request Aug 6, 2024
What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <[email protected]>
@StormLiangMS
Copy link

@bingwang-ms you may want to cherry pick this one?

mssonicbld pushed a commit to mssonicbld/sonic-linkmgrd that referenced this pull request Aug 21, 2024
)

What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to sonic-net#104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <[email protected]>
@bingwang-ms
Copy link

@bingwang-ms you may want to cherry pick this one?

Done. Thanks for the reminding.

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #267

mssonicbld pushed a commit that referenced this pull request Aug 21, 2024
What is the motivation for this PR?
Fix the race condition of the default route notification.

This is similar to #104

If there are multiple default route notifications received by linkmgrd, the mux port posts the default route handlers wrapped by strand. But boost asio doesn't guarantee the execution order of the default route handlers, so the final state machine default route could be any intermediate default route state.

For example, for default route notifications like:

[2024-06-20 08:28:57.872911] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: na
[2024-06-20 08:28:57.872954] [warning] MuxPort.cpp:365 handleDefaultRouteState: port: EtherTest01, state db default route state: ok
The final state machine default route state could be "ok" if the handler for "ok" is executed after the handler for "na".
The final state machine default route state could be "na" if the handler for "ok" is executed before the handler for "na".

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
post the default route handlers directly through strand instead of using strand::wrap, so the handlers are executed in the same order as the handlers' post order.

How did you verify/test it?
without this PR, UT fail:

Signed-off-by: Longxiang Lyu <[email protected]>
lolyu added a commit that referenced this pull request Aug 23, 2024
Approach
What is the motivation for this PR?
Fix the UT failure introduced by #254.
The failure is due to that, the wait time for the two default route handlers to finish is 10ms, which is not sufficient on some build image agents which has limited CPU resource.

Signed-off-by: Longxiang Lyu [email protected]

Work item tracking
Microsoft ADO (number only): 28471183
How did you do it?
Let's increase the wait time to 8s.

How did you verify/test it?
UT passed.

Any platform specific information?
Documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants