Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to remove dummy SAI objects after warm-reboot causing orchagent crash #1429

Open
Stephenxf opened this issue Oct 8, 2024 · 2 comments

Comments

@Stephenxf
Copy link

Summary

We have a bunch of switches with old 201811-based image to be upgraded to our next available image based off 202012. The warm-reboot upgrade went through, but the next warm-reboot (same 202012-based image, no upgrade) would end up with orchagent crash. The cause of that is syncd trying to remove some dummy SAI objects created by the previous warm-reboot upgrade but failing with SAI_STATUS_INVALID_PARAMETER error.

Details

1. With the 201811 image (before the warm-reboot upgrade), we have some QoS configs that create 5 SAI_OBJECT_TYPE_BUFFER_PROFILE entries.

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b10
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0d
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_STATIC_TH     : 15982720
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_STATIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b11
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0e
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b12
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 0
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 3
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b13
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 138112
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000b14
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000b0f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 53664
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

2. After warm-reboot upgrade to 202012-based image, the 5 objects are still present (with different oid's), and 5 dummy objects are also created. This might be a result of BRCM SAI upgrade.

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd3

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd4

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd5

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd6

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000bd7

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c60
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5d
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_STATIC_TH     : 15982720
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_STATIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c61
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5e
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1518
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c62
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 0
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 3
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c63
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 138112
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x19000000000c64
    SAI_BUFFER_PROFILE_ATTR_POOL_ID              : oid:0x18000000000c5f
    SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE : 1248
    SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH    : 0
    SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE       : SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
    SAI_BUFFER_PROFILE_ATTR_XOFF_TH              : 53664
    SAI_BUFFER_PROFILE_ATTR_XON_OFFSET_TH        : 2288
    SAI_BUFFER_PROFILE_ATTR_XON_TH               : 2288

3. With one additional warm-reboot to the same 202012-based image, syncd detects discrepancy of SAI_OBJECT_TYPE_BUFFER_PROFILE between current view (10 objects) and temp view (5 objects).

WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_BUFFER_PROFILE on current view 10 is different than on temporary view: 5

Then the attempt to remove the 5 dummy objects via executeOperationsOnAsic leads to SAI errors, eventually syncd stopped and orchagent exited.

Oct  8 09:18:02.721171 fab01 ERR syncd#syncd: [none] SAI_API_BUFFER:brcm_sai_remove_buffer_profile:2423 Getting buffer pool data failed with error -5.
Oct  8 09:18:02.721255 fab01 ERR syncd#syncd: :- asic_handle_generic: remove SAI_OBJECT_TYPE_BUFFER_PROFILE RID: oid:0x1900000001 VID oid:0x19000000000bd3 failed: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.721255 fab01 ERR syncd#syncd: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.721419 fab01 NOTICE syncd#syncd: :- executeOperationsOnAsic: asic apply took 0.014296 sec
Oct  8 09:18:02.721445 fab01 ERR syncd#syncd: :- executeOperationsOnAsic: Error while executing asic operations, ASIC is in inconsistent state: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.765644 fab01 NOTICE syncd#syncd: :- applyView: apply took 0.687701 sec
Oct  8 09:18:02.767236 fab01 ERR syncd#syncd: :- run: Runtime error: :- asic_process_event: failed to execute api: remove, key: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3, status: SAI_STATUS_INVALID_PARAMETER
Oct  8 09:18:02.767335 fab01 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Oct  8 09:18:02.769449 fab01 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfully
Oct  8 09:18:02.769614 fab01 ERR swss#orchagent: :- syncd_apply_view: Failed to notify syncd APPLY_VIEW -1
Oct  8 09:18:02.769962 fab01 ERR swss#orchagent: :- on_switch_shutdown_request: Syncd stopped
Oct  8 09:18:02.770225 fab01 INFO swss#/supervisord: orchagent free(): corrupted unsorted chunks
Oct  8 09:18:02.770952 fab01 INFO kernel: [   71.180447] traps: orchagent[2815] general protection ip:7f3abd15ae87 sp:7f3ab7ffe710 error:0 in libc-2.28.so[7f3abd143000+147000]
Oct  8 09:18:03.061421 fab01 INFO swss#supervisord 2024-10-08 09:18:03,060 INFO exited: orchagent (terminated by SIGSEGV (core dumped); not expected)

In this problem state, the ASIC_DB has 10 actual objects (including the 5 dummy ones) and 5 temp objects.

127.0.0.1:6379[1]> KEYS *BUFFER_PROFILE*
 1) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2d"
 2) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2c"
 3) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c63"
 4) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3"
 5) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c61"
 6) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2e"
 7) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2b"
 8) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c64"
 9) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c60"
10) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c62"
11) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd6"
12) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4"
13) "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000e2f"
14) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd5"
15) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd7"

Next Steps

Can you please help us understand the potential issue here?

  • In 2, is it expected to have extra SAI_OBJECT_TYPE_BUFFER_PROFILE objects created? This is different from PORT_SERDES objects, which were not there before upgrade but introduced as default objects after upgrade. Are these extra objects supposed to stay even after warm-reboot process is complete?
  • In 3, is it expected to call SAI remove() API to remove those extra/dummy objects? If so, is it possibly a BRCM SAI bug that the remove() operation fails?

Thanks in advance for assistance.

@kcudnik
Copy link
Collaborator

kcudnik commented Oct 9, 2024

hey, what do you mean dummy objects? syncd is creating objects based on real object id that could be discovered during warm boot to new firmware, then those objects will be created in asic-db without any attributes.

Since there is some issue here, it could be due to a bug in brcm or syncd, but i having a lot of similar issues with brcm, i would lean to that this is a bug on their side, probably, the same objects get different object IDs after warm boot

from your description it seems like you are doing :

  • cold boot to 201811
  • warm boot to 202012
  • warm boot to 202012

and that second warm boot fails.

for better analyze can you attach syslog from entire process from first cold boot to the last ?, if it's confidential, please send that to me directly on my email or teams

@Stephenxf
Copy link
Author

Regarding dummy objects, I was referring to the ones created in SaiSwitch::checkWarmBootDiscoveredRids():

1185           // this means that some new objects were discovered but they are
1186           // not present in current ASIC_VIEW, and we need to create dummy
1187           // entries for them
1188
1189           redisSetDummyAsicStateForRealObjectId(rid);

The problem seems to be that after the warm-reboot from 201811 to 202012, apart from the 5 real objects that are same as what existed on 201811 prior to upgrade, there are 5 additional objects that are the dummy ones with no attributes. I shared the saidump output earlier (after step 2); asic_db output showed the same:

$ redis-cli
127.0.0.1:6379> SELECT 1
OK
127.0.0.1:6379[1]> KEYS *BUFFER_PROFILE*
 1) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c62"
 2) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c64"
 3) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4"
 4) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c60"
 5) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd3"
 6) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c63"
 7) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd7"
 8) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c61"
 9) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd5"
10) "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd6"
127.0.0.1:6379[1]>
127.0.0.1:6379[1]> HGETALL "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000c62"
1) "SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE"
2) "SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC"
3) "SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH"
4) "3"
5) "SAI_BUFFER_PROFILE_ATTR_POOL_ID"
6) "oid:0x18000000000c5f"
7) "SAI_BUFFER_PROFILE_ATTR_RESERVED_BUFFER_SIZE"
8) "0"
127.0.0.1:6379[1]> HGETALL "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x19000000000bd4"
1) "NULL"
2) "NULL"
127.0.0.1:6379[1]>

I assume the 5 objects with oid ending with c60, c61, c62, c63 and c64 are the real objects; the 5 ones with oid ending with bd3, bd4, bd5, bd6 and bd7 are the dummy ones that are supposed to be transient, but remain there after warm-reboot is complete.

Yes, you got the steps all correct. I have records for all 3 steps (one cold boot followed by two warm boots). I'll get you the logs separately shortly. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants