Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[chassis] route_check fails on LC due to timeout on frr routes #18773

Open
anamehra opened this issue Apr 23, 2024 · 12 comments · May be fixed by sonic-net/sonic-utilities#3544
Open

[chassis] route_check fails on LC due to timeout on frr routes #18773

anamehra opened this issue Apr 23, 2024 · 12 comments · May be fixed by sonic-net/sonic-utilities#3544
Assignees
Labels
Chassis 🤖 Modular chassis support Triaged this issue has been triaged

Comments

@anamehra
Copy link
Contributor

Description

On chassis, after the introduction of frr route check in route_check.py (sonic-net/sonic-utilities#2762), route_check.py may take more than 2 mins to finish. The current timeout is 2 mins which causes route check to fail and affects monit output. This affects the sonic-mgmt pretest check. Other test cases relying on monit output may also be affected.

root@sfd-t2-lc0:/home/cisco# time route_check.py                                                                                                                                                                                                              [[BAborting routeCheck.py upon timeout signal after 120 seconds                                                                                                                                                                                            
[<FrameSummary file /usr/local/bin/route_check.py, line 810 in <module>>, <FrameSummary file /usr/local/bin/route_check.py, line 797 in main>, <FrameSummary file /usr/local/bin/route_check.py, line 745 in check_routes>, <FrameSummary file /usr/local/bin/│·
route_check.py, line 537 in check_frr_pending_routes>, <FrameSummary file /usr/local/bin/route_check.py, line 345 in get_frr_routes>, <FrameSummary file /usr/lib/python3.9/subprocess.py, line 424 in check_output>, <FrameSummary file /usr/lib/python3.9/su│·
bprocess.py, line 507 in run>, <FrameSummary file /usr/lib/python3.9/subprocess.py, line 1121 in communicate>, <FrameSummary file /usr/local/bin/route_check.py, line 95 in handler>]                                                                      
Traceback (most recent call last):                                                                                                                                                                                                                            
  File "/usr/local/bin/route_check.py", line 810, in <module>                                                                                                                                                                                                 
    sys.exit(main()[0])                                                                                                                                                                                                                                       
  File "/usr/local/bin/route_check.py", line 797, in main                                                                                                                                                                                                     
    ret, res= check_routes()                                                                                                                                                                                                                                  
  File "/usr/local/bin/route_check.py", line 745, in check_routes                                                                                                                                                                                             
    rt_frr_miss = check_frr_pending_routes()                                                                                                                                                                                                                  
  File "/usr/local/bin/route_check.py", line 537, in check_frr_pending_routes                                                                                                                                                                                 
    frr_routes = get_frr_routes()                                                                                                                                                                                                                             
  File "/usr/local/bin/route_check.py", line 345, in get_frr_routes                                                                                                                                                                                           
    output = subprocess.check_output('show ipv6 route json', shell=True)                                                                                                                                                                                      
  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output                                                                                                                                                                                          
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,                                                                                                                                                                                          
  File "/usr/lib/python3.9/subprocess.py", line 507, in run                                                                                                                                                                                                   
    stdout, stderr = process.communicate(input, timeout=timeout)                                                                                                                                                                                              
  File "/usr/lib/python3.9/subprocess.py", line 1121, in communicate                                                                                                                                                                                          
    stdout = self.stdout.read()                                                                                                                                                                                                                               
  File "/usr/local/bin/route_check.py", line 96, in handler                                                                                                                                                                                                   
    raise Exception("timeout occurred")                                                                                                                                                                                                                       
Exception: timeout occurred                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                          
real    2m0.714s                                                                                                                                                                                                                                              
user    0m57.700s                                                                                                                                                                                                                                             
sys     0m2.939s 

The issue was opened for 202305 earlier which was fixed by reverting the feature for frr route check: #17403

This needs to be fixed for master.

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@bingwang-ms bingwang-ms added the Triaged this issue has been triaged label Apr 24, 2024
@bingwang-ms
Copy link
Contributor

The issue will be triaged further in the chassis meeting

@judyjoseph judyjoseph added the Chassis 🤖 Modular chassis support label Apr 24, 2024
@abdosi
Copy link
Contributor

abdosi commented Apr 26, 2024

@stephenxs @stepanblyschak @liat-grozovik : can you please help with this.

@abdosi
Copy link
Contributor

abdosi commented Apr 26, 2024

@judyjoseph @arlakshm @mlok-nokia @ysmanman for viz. Will apply for master image also.

@arlakshm
Copy link
Contributor

arlakshm commented May 8, 2024

Feature 'Install before advt.' might be disable for 202405.

@stepanblyschak
Copy link
Collaborator

@anamehra Could you please share a tech support when the issue occurs? What is the route scale on the system?
If you have an opportunity to play with the system, could you please increase the timeout to 1h and check whether route_check.py eventually finishes or is stuck without progress?

@anamehra
Copy link
Contributor Author

@anamehra Could you please share a tech support when the issue occurs? What is the route scale on the system? If you have an opportunity to play with the system, could you please increase the timeout to 1h and check whether route_check.py eventually finishes or is stuck without progress?

The route_check eventually finished. I saw it took a couple of more mins. We have 50K routes. I will check on show tech.

@rlhui
Copy link
Contributor

rlhui commented Jun 12, 2024

this is currently still an issue with 202405

@rlhui rlhui assigned deepak-singhal0408 and unassigned rlhui Jul 19, 2024
@mannytaheri
Copy link

@deepak-singhal0408 - I have attached logs for routeCheck issue.
routeCheck_logs.txt

@deepak-singhal0408
Copy link
Contributor

deepak-singhal0408 commented Sep 10, 2024

this feature is enabled back in Master. #19836

@deepak-singhal0408
Copy link
Contributor

deepak-singhal0408 commented Sep 11, 2024

Tried 2 iterations with device having 32k v4+32k v6 routes..

Neighbhor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd NeighborName


10.0.0.1 4 65200 61249 14398 0 0 0 01:02:57 1 ARISTA01T3
10.0.0.5 4 65200 0 0 0 0 0 never Active ARISTA03T3
10.0.0.7 4 65200 6059 5857 0 0 0 4d00h24m 1 ARISTA04T3
10.0.0.11 4 65200 6056 5856 0 0 0 4d00h24m 33793 ARISTA06T3

Iteration1: <<<<<<<<<<<<<<<<<
Checking routes for namespaces: ['asic0', 'asic1']

real 3m16.387s
user 1m26.084s
sys 0m7.275s

Iteration2: <<<<<<<<<<<<<<<<<<<<<<<<<
Checking routes for namespaces: ['asic0', 'asic1']

real 3m18.249s
user 1m26.760s
sys 0m7.926s

@deepak-singhal0408
Copy link
Contributor

deepak-singhal0408 commented Sep 11, 2024

python -m cProfile -s time route_check.py
122726378 function calls (82385912 primitive calls) in 216.529 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
6 90.089 15.015 90.089 15.015 {built-in method time.sleep}
14 82.537 5.896 82.653 5.904 {method 'read' of '_io.TextIOWrapper' objects}
51279296/15794766 10.061 0.000 15.341 0.000 encoder.py:333(_iterencode_dict)
2 6.252 3.126 6.252 3.126 {built-in method swsscommon._swsscommon.new_SubscriberStateTable}
12 4.621 0.385 4.621 0.385 decoder.py:343(raw_decode)
20647482/15794694 3.588 0.000 10.100 0.000 encoder.py:277(_iterencode_list)
106 2.978 0.028 2.978 0.028 {method 'format' of 'str' objects}
15794766 2.714 0.000 18.055 0.000 encoder.py:413(_iterencode)
9 1.360 0.151 19.613 2.179 encoder.py:182(encode)
12982522 1.148 0.000 1.148 0.000 {built-in method builtins.isinstance}
205278 0.854 0.000 1.632 0.000 ipaddress.py:1603(_ip_int_from_string)
4736970 0.720 0.000 0.720 0.000 {built-in method _json.encode_basestring_ascii}
821056 0.655 0.000 0.891 0.000 ipaddress.py:1201(_parse_octet)
410527 0.453 0.000 2.253 0.000 ipaddress.py:1269(init)
410514 0.446 0.000 1.700 0.000 ipaddress.py:1175(_ip_int_from_string)
615687 0.381 0.000 0.666 0.000 ipaddress.py:1707(_parse_hextet)
2 0.374 0.187 180.955 90.478 route_check.py:520(check_frr_pending_routes) <<<<<<<<<<<<<<<<<
205295 0.316 0.000 2.137 0.000 ipaddress.py:1875(init)
139211 0.288 0.000 0.289 0.000 {method 'join' of 'str' objects}
273646 0.288 0.000 3.931 0.000 route_check.py:165(is_local)
1231834 0.285 0.000 0.285 0.000 {method 'split' of 'str' objects}

@deepak-singhal0408
Copy link
Contributor

With following optimizations, route_check time is reduced to 1m30sec.

  1. Parallel execution for each asic namespace
  2. parallel fetching of routes for v4 and v6
    time route_check.py
    real 1m30.675s
    user 1m33.777s
    sys 0m8.209s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support Triaged this issue has been triaged
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

9 participants