Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart Switch reboot high level design #1699

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

vvolam
Copy link

@vvolam vvolam commented May 16, 2024

This PR is for smart switch reboot high-level design

Repo Pull Request Status
sonic-gnmi sonic-net/sonic-gnmi#286 Open
sonic-host-services sonic-net/sonic-host-services#164 Open
sonic-platform-common sonic-net/sonic-platform-common#501 Merged
sonic-platform-common sonic-net/sonic-platform-common#504 Open
sonic-utilities sonic-net/sonic-utilities#3566 Draft
sonic-platform-daemons sonic-net/sonic-platform-daemons#546 Draft

@vvolam vvolam marked this pull request as ready for review May 16, 2024 23:15
@isabelmsft isabelmsft self-requested a review May 20, 2024 23:31
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved
ganglyu
ganglyu previously approved these changes Jun 17, 2024
isabelmsft
isabelmsft previously approved these changes Jun 18, 2024
@vvolam vvolam dismissed stale reviews from isabelmsft and ganglyu via 1934915 June 26, 2024 19:00
Copy link
Contributor

@oleksandrivantsiv oleksandrivantsiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented

@vvolam vvolam force-pushed the reboot-hld branch 2 times, most recently from 1c9a020 to 7d67e25 Compare July 30, 2024 01:16
@KrisNey-MSFT
Copy link

Discussed in DASH Community call 9/18/2024
If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch?
Via PCIE express lanes, CPLD, or other?
Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific?
@prgeor

@prgeor
Copy link
Contributor

prgeor commented Sep 18, 2024

@vvolam

Discussed in DASH Community call 9/18/2024 If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch? Via PCIE express lanes, CPLD, or other? Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific? @prgeor

@vvolam FYI

@prgeor
Copy link
Contributor

prgeor commented Sep 18, 2024

@vvolam please add all the code PRs to this HLD PR description

| Planned cold reboot of DPU | - | Graceful reboot |
| Planned power-cycle of Smart Switch | Graceful reboot | Graceful reboot |
| Planned power-cycle of DPU | - | Graceful reboot |
| Unplanned DPU power failure | - | Ungraceful reboot |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam how are we planning to induce this failure in sonic-mgmt test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor I don't have implementation details now. If this cannot be done in sonic-mgmt test, I plan to atleast cover this during unit-test.


* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it.

* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam can you elaborate on this ack "DPUs will send an acknowledgment to the NPU and then undergo a reboot" ? LIke how is this implemented?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not actually acknowledgment but response to platform vendor API. I have reworded accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants