-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to remove dummy SAI objects after warm-reboot causing orchagent crash #1429
Comments
hey, what do you mean dummy objects? syncd is creating objects based on real object id that could be discovered during warm boot to new firmware, then those objects will be created in asic-db without any attributes. Since there is some issue here, it could be due to a bug in brcm or syncd, but i having a lot of similar issues with brcm, i would lean to that this is a bug on their side, probably, the same objects get different object IDs after warm boot from your description it seems like you are doing :
and that second warm boot fails. for better analyze can you attach syslog from entire process from first cold boot to the last ?, if it's confidential, please send that to me directly on my email or teams |
Regarding dummy objects, I was referring to the ones created in
The problem seems to be that after the warm-reboot from 201811 to 202012, apart from the 5 real objects that are same as what existed on 201811 prior to upgrade, there are 5 additional objects that are the dummy ones with no attributes. I shared the saidump output earlier (after step 2); asic_db output showed the same:
I assume the 5 objects with oid ending with c60, c61, c62, c63 and c64 are the real objects; the 5 ones with oid ending with bd3, bd4, bd5, bd6 and bd7 are the dummy ones that are supposed to be transient, but remain there after warm-reboot is complete. Yes, you got the steps all correct. I have records for all 3 steps (one cold boot followed by two warm boots). I'll get you the logs separately shortly. Thanks! |
Summary
We have a bunch of switches with old 201811-based image to be upgraded to our next available image based off 202012. The warm-reboot upgrade went through, but the next warm-reboot (same 202012-based image, no upgrade) would end up with orchagent crash. The cause of that is syncd trying to remove some dummy SAI objects created by the previous warm-reboot upgrade but failing with
SAI_STATUS_INVALID_PARAMETER
error.Details
1. With the 201811 image (before the warm-reboot upgrade), we have some QoS configs that create 5
SAI_OBJECT_TYPE_BUFFER_PROFILE
entries.2. After warm-reboot upgrade to 202012-based image, the 5 objects are still present (with different oid's), and 5 dummy objects are also created. This might be a result of BRCM SAI upgrade.
3. With one additional warm-reboot to the same 202012-based image, syncd detects discrepancy of
SAI_OBJECT_TYPE_BUFFER_PROFILE
between current view (10 objects) and temp view (5 objects).Then the attempt to remove the 5 dummy objects via
executeOperationsOnAsic
leads to SAI errors, eventually syncd stopped and orchagent exited.In this problem state, the ASIC_DB has 10 actual objects (including the 5 dummy ones) and 5 temp objects.
Next Steps
Can you please help us understand the potential issue here?
SAI_OBJECT_TYPE_BUFFER_PROFILE
objects created? This is different fromPORT_SERDES
objects, which were not there before upgrade but introduced as default objects after upgrade. Are these extra objects supposed to stay even after warm-reboot process is complete?remove()
API to remove those extra/dummy objects? If so, is it possibly a BRCM SAI bug that theremove()
operation fails?Thanks in advance for assistance.
The text was updated successfully, but these errors were encountered: