Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Orch CPU utilization timeout before link flap #16187

Merged

Conversation

arista-hpandya
Copy link
Contributor

@arista-hpandya arista-hpandya commented Dec 20, 2024

This change was made because in modular chassis with multi-asic LCs, the link flap test might run on the uplink LC followed by the downlink LC. Since the uplink has a lot of neighbors the downlink CPU is busy re-routing the different pathways. In such a scenario, the downlink LC will still be hot (above 10% utilization) before we flap its interfaces. Hence, the increase in timeout.

We tested it with a timeout of 500 and it failed so we are increasing it to 600 which has been passing on our local T2 testbeds.

Description of PR

Summary:
Fixes #16186

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

To make sure that the timeout for the Orchagent CPU utilization check is large enough for the test to pass.

How did you do it?

Increased the timeout from 100 to 600.

How did you verify/test it?

Ran the test on T2 testbed with a timeout of 600 (Passed) and 500 (Failed)

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

This change was made because in modular chassis with multi-asic LCs, the
link flap test might run on the uplink LC followed by the downlink LC.
In such a scenario, the downlink LC will still be hot (above 10%
utilization) before we flap its interfaces. Hence, the increase in
timeout.

We tested it with a timeout of 500 and it failed so we are increasing it
to 600 which has been passing on our local T2 testbeds.
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wenyiz2021
Copy link
Contributor

@arista-hpandya could you redefine the timeout in continuous link flap for T2?
basically leave the timeout as 100sec for T0 and T1, we don't want to increase for t0/t1.
for T2 we can increase to 500sec

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arista-hpandya
Copy link
Contributor Author

@arista-hpandya could you redefine the timeout in continuous link flap for T2? basically leave the timeout as 100sec for T0 and T1, we don't want to increase for t0/t1. for T2 we can increase to 500sec

Hi @wenyiz2021 ! Thanks for reviewing this. I have made the changes to increase the timeout only for T2 devices. Also, on a side note happy new year!

@rlhui rlhui requested a review from liamkearney-msft January 3, 2025 04:56
Copy link
Contributor

@liamkearney-msft liamkearney-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, otherwise lgtm

@arlakshm
Copy link
Contributor

arlakshm commented Jan 4, 2025

/Azp run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arista-hpandya
Copy link
Contributor Author

/azpw run Azure.sonic-mgmt

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arlakshm
Copy link
Contributor

/AzurePipelines run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arista-hpandya
Copy link
Contributor Author

/azpw run Azure.sonic-mgmt

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arista-hpandya
Copy link
Contributor Author

/azpw run Azure.sonic-mgmt

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arista-hpandya
Copy link
Contributor Author

@arlakshm Looks like there is a docker install issue breaking the test pipelines. Re-running the checks from my end do not solve this issue. Can someone from MSFT verify if this is indeed a pipeline/infra issue?

Reading package lists...
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 4642 (dpkg)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

@Javier-Tan
Copy link
Contributor

@arlakshm Looks like there is a docker install issue breaking the test pipelines. Re-running the checks from my end do not solve this issue. Can someone from MSFT verify if this is indeed a pipeline/infra issue?

Reading package lists...
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 4642 (dpkg)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Hi @arista-hpandya, I've rerun the validate testcases and it seems to pass this time

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202405: Azure/sonic-mgmt.msft#109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug][202405]: Failed: Orch CPU utilization > orch cpu threshold 10 before link flap
7 participants