Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dualtor] CRM test fails on test_crm_nexthop_group when tunnel route created for PortChannel neighbor #21243

Open
stepanblyschak opened this issue Dec 20, 2024 · 0 comments

Comments

@stepanblyschak
Copy link
Collaborator

Description

CRM test fails on test_crm_nexthop_group test case due to orchagent crash:

2024 Dec 19 18:15:25.382634 sonic ERR swss#orchagent: :- removeRoutePost: Failed to remove route prefix:2.0.4.99/32
2024 Dec 19 18:15:25.382654 sonic NOTICE swss#orchagent: :- removeRoutePost: Remove Nexthop Group 2.0.0.1@PortChannel101,2.0.4.99@PortChannel101
2024 Dec 19 18:15:25.382654 sonic INFO swss#orchagent: :- isMuxNexthops: No mux nexthop found
2024 Dec 19 18:15:25.382666 sonic INFO swss#orchagent: :- removeRoutePost: Remove route 2.0.4.99/32 with next hop(s) 2.0.0.1@PortChannel101,2.0.4.99@PortChannel101
2024 Dec 19 18:15:25.382884 sonic NOTICE swss#orchagent: :- removeNextHopGroup: Delete next hop group 2.0.0.1@PortChannel101,2.0.4.10@PortChannel101
2024 Dec 19 18:15:25.383167 sonic NOTICE pmon#ycabled: Async client port = Ethernet76 exception occured because of StatusCode.UNAVAILABLE tid YCableAsyncNotificationTask
2024 Dec 19 18:15:25.387805 sonic INFO swss#orchagent: :- flush_removing_entries: ObjectBulker.flush removing_entries 2 rc=0 statuses[0]=0
2024 Dec 19 18:15:25.387805 sonic ERR swss#orchagent: :- meta_generic_validation_remove: object 0x5000000001c67 reference count is 1, can't remove
2024 Dec 19 18:15:25.387829 sonic ERR swss#orchagent: :- removeNextHopGroup: Failed to remove next hop group 5000000001c67, rv:-17
2024 Dec 19 18:15:25.387840 sonic ERR swss#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP_GROUP, status: SAI_STATUS_OBJECT_IN_US
E
2024 Dec 19 18:15:25.387846 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP

Also, during some investigation found another flow which triggers a different crash:

from swsscommon import swsscommon

def to_fvs(fvs):
    return swsscommon.FieldValuePairs([(k, v) for k, v in fvs.items()])

db = swsscommon.DBConnector("APPL_DB", 0)
ntbl = swsscommon.ProducerStateTable(db, "NEIGH_TABLE")
rtbl = swsscommon.ProducerStateTable(db, "ROUTE_TABLE")

ip = "2.0.6.42"
ip2 = "2.0.0.2"

ntbl.set(f"PortChannel101:{ip2}", to_fvs({"neigh": "00:22:00:00:00:11", "family": "IPv4"}))
rtbl.set("2.0.0.0/8", to_fvs({"protocol": "kernel", "nexthop": f"0.0.0.0", "ifname": "PortChannel101"}))


ntbl.set(f"PortChannel101:{ip}", to_fvs({"neigh": "00:00:00:00:00:00", "family": "IPv4"})) # needs to be 0 mac

rtbl.set(ip, to_fvs({"protocol": "kernel", "nexthop": f"{ip2},{ip}", "ifname": "PortChannel101,PortChannel101"}))


ntbl.set(f"PortChannel101:{ip}", to_fvs({"neigh": "00:11:00:00:00:22", "family": "IPv4"}))

rtbl._del(ip)
ntbl._del(f"PortChannel101:{ip}")

ntbl._del(f"PortChannel101:{ip2}")
rtbl._del("2.0.0.0/8")

Got failure:

27:17.657861 r-tigon-21 NOTICE swss#orchagent: :- addNeighbor: Created neighbor ip 2.0.6.42, 00:11:00:00:00:22 on PortChannel101
2024 Dec 19 18:27:17.663114 r-tigon-21 NOTICE swss#orchagent: :- addNextHop: Created next hop 2.0.6.42 on PortChannel101
2024 Dec 19 18:27:17.668304 r-tigon-21 NOTICE swss#orchagent: :- remove_route: Removed tunnel route to 2.0.6.42/32
2024 Dec 19 18:27:17.671257 r-tigon-21 NOTICE swss#orchagent: :- addNextHopGroup: Create next hop group 2.0.0.2@PortChannel101,2.0.6.42@PortChannel101
2024 Dec 19 18:27:17.674968 r-tigon-21 ERR swss#orchagent: :- meta_sai_validate_route_entry: object key SAI_OBJECT_TYPE_ROUTE_ENTRY:{"dest":"2.0.6.42/32","switch_id":"oid:0x21000000000000","vr":"oid:0x3000000000002"} doesn't exist
2024 Dec 19 18:27:17.675064 r-tigon-21 ERR swss#orchagent: :- flush_setting_entries: EntityBulker.flush set entry attribute failed, number of entries to set: 1, status: SAI_STATUS_ITEM_NOT_FOUND
2024 Dec 19 18:27:17.675101 r-tigon-21 ERR swss#orchagent: :- addRoutePost: Failed to set route 2.0.6.42/32 with next hop(s) 2.0.0.2@PortChannel101,2.0.6.42@PortChannel101
2024 Dec 19 18:27:17.675123 r-tigon-21 ERR swss#orchagent: :- handleSaiSetStatus: Encountered failure in set operation, exiting orchagent, SAI API: SAI_API_ROUTE, status: SAI_STATUS_NOT_EXECUTED
2024 Dec 19 18:27:17.675151 r-tigon-21 NOTICE swss#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Dec 19 18:27:30.996722 r-tigon-21 NOTICE swss#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded
2024 Dec 19 18:27:32.014192 r-tigon-21 INFO swss#supervisord 2024-12-19 18:27:32,012 WARN exited: orchagent (terminated by SIGABRT (core dumped); not expected)

Steps to reproduce the issue:

  1. Run test:
python3 -m pytest crm/test_crm.py --inventory="../ansible/inventory,../ansible/veos" --host-pattern mtvr-tigon-02,mtvr-tigon-04 --module-path ../ansible/library/ --testbed sonic-dual-tor-tigon-01-dualtor-aa --testbed_file ../ansible/testbed.yaml --allow_recover  --assert plain --log-cli-level debug --show-capture=no -ra --showlocals -v --clean-alluredir --alluredir=/tmp/allure-results --timeout 6000 --session-timeout 6000 --allure_server_addr="allure.nvidia.com" --allure_server_port= --topology dualtor,any,util,t0 --skip_sanity --dynamic_update_skip_reason --random_seed=1734183076 --store_la_logs --ignore_la_failure

Describe the results you received:

Orchagent creates tunnel route when receiving neighbor with 0 Mac on PortChannel, causing various issues when same prefix is installed via RouteOrch.
Orchagent crash due to an attempt to remove NHG which is still referenced by a route.

Describe the results you expected:

No crashing, no tunnel route created on PortChannel.

Output of show version:


SONiC Software Version: SONiC.202405_RC.60-4952d6aea_Internal
SONiC OS Version: 12
Distribution: Debian 12.8
Kernel: 6.1.0-22-2-amd64
Build commit: 51824d390
Build date: Wed Dec 18 09:24:32 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci03-243

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant