Smart Switch reboot high level design #1699

vvolam · 2024-05-16T19:45:34Z

This PR is for smart switch reboot high-level design

Repo	Pull Request	Status
sonic-gnmi	sonic-net/sonic-gnmi#286	Open
sonic-host-services	sonic-net/sonic-host-services#164	Open
sonic-platform-common	sonic-net/sonic-platform-common#501	Merged
sonic-platform-common	sonic-net/sonic-platform-common#504	Open
sonic-utilities	sonic-net/sonic-utilities#3566	Draft
sonic-platform-daemons	sonic-net/sonic-platform-daemons#546	Draft

doc/smart-switch/reboot/reboot-hld.md

oleksandrivantsiv

As commented

doc/smart-switch/reboot/reboot-hld.md

KrisNey-MSFT · 2024-09-18T16:22:09Z

Discussed in DASH Community call 9/18/2024
If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch?
Via PCIE express lanes, CPLD, or other?
Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific?
@prgeor

prgeor · 2024-09-18T22:41:12Z

@vvolam

Discussed in DASH Community call 9/18/2024 If the DPU is unresponsive and we are trying to recover it, is there a way to hard power cycle a DPU w/o having to power cycle the switch? Via PCIE express lanes, CPLD, or other? Force-shut or force-reboot the card (w/o forcing the entire switch), and will it be standardized or supplier-specific? @prgeor

@vvolam FYI

prgeor · 2024-09-18T22:41:36Z

@vvolam please add all the code PRs to this HLD PR description

This is initial draft

sonic-net/sonic-platform-common#454

doc/smart-switch/reboot/reboot-hld.md

prgeor · 2024-10-06T15:01:19Z

doc/smart-switch/reboot/reboot-hld.md

+| Planned cold reboot of DPU                | -                   | Graceful reboot     |
+| Planned power-cycle of Smart Switch       | Graceful reboot     | Graceful reboot     |
+| Planned power-cycle of DPU                | -                   | Graceful reboot     |
+| Unplanned DPU power failure               | -                   | Ungraceful reboot   |


@vvolam how are we planning to induce this failure in sonic-mgmt test?

@prgeor I don't have implementation details now. If this cannot be done in sonic-mgmt test, I plan to atleast cover this during unit-test.

doc/smart-switch/reboot/reboot-hld.md

prgeor · 2024-10-07T02:17:56Z

doc/smart-switch/reboot/reboot-hld.md

+
+* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it.
+
+* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.


@vvolam can you elaborate on this ack "DPUs will send an acknowledgment to the NPU and then undergo a reboot" ? LIke how is this implemented?

It is not actually acknowledgment but response to platform vendor API. I have reworded accordingly.

doc/smart-switch/reboot/reboot-hld.md

vvolam marked this pull request as ready for review May 16, 2024 23:15

vvolam requested review from oleksandrivantsiv, rameshraghupathy, prgeor, r12f and dgsudharsan May 16, 2024 23:17

isabelmsft self-requested a review May 20, 2024 23:31

isabelmsft reviewed May 21, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

vvolam force-pushed the reboot-hld branch from 47fa43b to 829b510 Compare May 29, 2024 22:54

vvolam requested review from qiluo-msft and ganglyu May 30, 2024 15:15

rameshraghupathy reviewed May 30, 2024

View reviewed changes

vvolam force-pushed the reboot-hld branch from c45142a to 075d745 Compare June 10, 2024 23:48

isabelmsft reviewed Jun 11, 2024

View reviewed changes

oleksandrivantsiv mentioned this pull request Jun 14, 2024

Smartswitch Platform Test Plan Document sonic-net/sonic-mgmt#12701

Merged

5 tasks

ganglyu previously approved these changes Jun 17, 2024

View reviewed changes

isabelmsft previously approved these changes Jun 18, 2024

View reviewed changes

oleksandrivantsiv reviewed Jun 24, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

oleksandrivantsiv approved these changes Jun 24, 2024

View reviewed changes

oleksandrivantsiv mentioned this pull request Jun 25, 2024

PMON Test Plan sonic-net/sonic-mgmt#13200

Open

oleksandrivantsiv reviewed Jun 26, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

vvolam dismissed stale reviews from isabelmsft and ganglyu via 1934915 June 26, 2024 19:00

oleksandrivantsiv suggested changes Jun 28, 2024

View reviewed changes

oleksandrivantsiv reviewed Jul 2, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

vvolam force-pushed the reboot-hld branch 2 times, most recently from 1c9a020 to 7d67e25 Compare July 30, 2024 01:16

rameshraghupathy reviewed Aug 5, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

rameshraghupathy reviewed Aug 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

vvolam force-pushed the reboot-hld branch from 7d67e25 to 8b70aab Compare August 6, 2024 19:26

vvolam mentioned this pull request Sep 11, 2024

Added schema for health_info, reboot_cause on chassisStateDB and added the link to pmon-test-plan #1709

Open

vvolam mentioned this pull request Sep 24, 2024

Added new Platform APIs and modified APIs for supporting reboot on a SmartSwitch sonic-net/sonic-platform-common#501

Merged

vvolam added 11 commits September 24, 2024 21:53

Smart Switch reboot high level design

46d8e0a

This is initial draft

Update HLD with modified APIs and images

2af6880

Minor update to test plan

7539f7a

Minor changes based on discussion with the community

24c47fb

Address review comments

94dec18

Minor correction to pci rescan information

c050f48

Update reboot mechanism of the DPU and pcie daemon changes

a0c9412

Minor changes

6b165b2

Minor changes

a37115c

Made a minor change to dup_id based on get_dpu_id() update in

442e8a7

sonic-net/sonic-platform-common#454

Add some enhancements

f7ca496

vvolam force-pushed the reboot-hld branch from 26f3f4e to f7ca496 Compare September 24, 2024 22:44

Minor change to new APIs

605c3a5

vvolam mentioned this pull request Oct 5, 2024

Enhance PCIe device check to skip the warning log, if device is in detaching mode sonic-net/sonic-platform-daemons#546

Draft

prgeor reviewed Oct 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

prgeor reviewed Oct 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

prgeor reviewed Oct 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

prgeor reviewed Oct 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

prgeor reviewed Oct 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

prgeor reviewed Oct 6, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

prgeor reviewed Oct 7, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Show resolved Hide resolved

prgeor reviewed Oct 7, 2024

View reviewed changes

doc/smart-switch/reboot/reboot-hld.md Outdated Show resolved Hide resolved

Address review comments

04240e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart Switch reboot high level design #1699

Smart Switch reboot high level design #1699

vvolam commented May 16, 2024 •

edited

Loading

oleksandrivantsiv left a comment

KrisNey-MSFT commented Sep 18, 2024

prgeor commented Sep 18, 2024

prgeor commented Sep 18, 2024

prgeor Oct 6, 2024

vvolam Oct 9, 2024

prgeor Oct 7, 2024

vvolam Oct 9, 2024


		* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it.

		* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully.

Smart Switch reboot high level design #1699

Are you sure you want to change the base?

Smart Switch reboot high level design #1699

Conversation

vvolam commented May 16, 2024 • edited Loading

oleksandrivantsiv left a comment

Choose a reason for hiding this comment

KrisNey-MSFT commented Sep 18, 2024

prgeor commented Sep 18, 2024

prgeor commented Sep 18, 2024

prgeor Oct 6, 2024

Choose a reason for hiding this comment

vvolam Oct 9, 2024

Choose a reason for hiding this comment

prgeor Oct 7, 2024

Choose a reason for hiding this comment

vvolam Oct 9, 2024

Choose a reason for hiding this comment

vvolam commented May 16, 2024 •

edited

Loading