-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smart Switch reboot high level design #1699
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented
1c9a020
to
7d67e25
Compare
Discussed in DASH Community call 9/18/2024 |
@vvolam FYI |
@vvolam please add all the code PRs to this HLD PR description |
This is initial draft
| Planned cold reboot of DPU | - | Graceful reboot | | ||
| Planned power-cycle of Smart Switch | Graceful reboot | Graceful reboot | | ||
| Planned power-cycle of DPU | - | Graceful reboot | | ||
| Unplanned DPU power failure | - | Ungraceful reboot | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam how are we planning to induce this failure in sonic-mgmt test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor I don't have implementation details now. If this cannot be done in sonic-mgmt test, I plan to atleast cover this during unit-test.
|
||
* With the DPUs prepared for reboot, the NPU triggers a platform vendor API to initiate the reboot process for the DPUs. Vendor API reboots a single DPU, but the NPU spawns multiple threads to reboot DPUs in parallel. If any of the the DPU is stuck or unresponsive, the DPU reboot platform API should attempt a cold boot or power cycle to recover it. | ||
|
||
* DPUs will send an acknowledgment to the NPU and then undergo a reboot. After receiving the acknowledgment from the DPUs, the NPU will proceed to reboot itself to complete the overall reboot procedure. The vendor-specific reboot API should include an error handling mechanism to manage DPU reboot failures. Additionally log all the failures. DPUs will be in DPU_READY state, if the reboot happened successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam can you elaborate on this ack "DPUs will send an acknowledgment to the NPU and then undergo a reboot" ? LIke how is this implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not actually acknowledgment but response to platform vendor API. I have reworded accordingly.
This PR is for smart switch reboot high-level design