Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: qos/test_qos_sai.py::TestQosSai::testQosSaiPgSharedWatermark fails with multi_asic and multi_dut variants #16167

Open
arista-nwolfe opened this issue Dec 19, 2024 · 1 comment · May be fixed by #16169

Comments

@arista-nwolfe
Copy link
Contributor

Issue Description

Failure seen:

dst_port_id: 47, src_port_id: 34 src_port_vlan: None
actual dst_port_id: 47
Initial watermark:[112, 0, 0, 0, 256, 0, 0, 0]
Received packets: 0
Init pkts num sent: 0, min: 0, actual watermark value to start: 0
Filled PG min
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
|               | Pfc3TxPkt | InDiscard | InDropPkt | OutDiscard | OutDropPkt | OutUcPkt | InUcPkt | InNonUcPkt | OutNonUcPkt | OutQlen | Ing Pg3 Pkt | Ing Pg3 Share Wm |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
| base src port |  3159299  |  2821770  |     0     |     0      |     0      |    4     | 5839385 |    2442    |     1248    |    0    |    837638   |        0         |
|      src port |  3159299  |  2821770  |     0     |     0      |     0      |    4     | 5839385 |    2443    |     1249    |    0    |    837638   |        0         |
| base dst port |     0     |     3     |     0     |     0      |     0      |  422261  |   5177  |    1224    |      88     |    0    |      0      |        0         |
|      dst port |     0     |     3     |     0     |     0      |     0      |  422261  |   5177  |    1224    |      88     |    0    |      0      |        0         |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
pkts num to send: 41, total pkts: 41, pg shared: 415271
Compensate 2176538 packets to port 34, and retry 1 times
Received packets: 418930
To fill PG share pool, send 41 pkt
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
|               | Pfc3TxPkt | InDiscard | InDropPkt | OutDiscard | OutDropPkt | OutUcPkt | InUcPkt | InNonUcPkt | OutNonUcPkt | OutQlen | Ing Pg3 Pkt | Ing Pg3 Share Wm |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
| base src port |  3159299  |  2821770  |     0     |     0      |     0      |    4     | 5839385 |    2442    |     1248    |    0    |    837638   |        0         |
|      src port |  5555195  |  4576881  |     0     |     0      |     0      |    4     | 8013426 |    2493    |     1274    |    0    |   1256568   |     46510464     |
| base dst port |     0     |     3     |     0     |     0      |     0      |  422261  |   5177  |    1224    |      88     |    0    |      0      |        0         |
|      dst port |     0     |     3     |     0     |     0      |     0      |  422408  |   5227  |    1250    |      88     |    0    |      0      |        0         |
+---------------+-----------+-----------+-----------+------------+------------+----------+---------+------------+-------------+---------+-------------+------------------+
lower bound: 167936, actual value: 46510464, upper bound (+40): 9072
> /root/saitests/py3/sai_qos_tests.py(4624)runTest()
======================================================================
FAIL: sai_qos_tests.PGSharedWatermarkTest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "saitests/py3/sai_qos_tests.py", line 4623, in runTest
    * (packet_length + internal_hdr_size)))
AssertionError

----------------------------------------------------------------------
Ran 1 test in 934.434s

The issue appears to be during the dynamically_compensate_leakout
Compensate 2176538 packets to port 34, and retry 1 times
We can see this is sending far too many packets 2176538.

This function compares the TX_OK value before and after sending the 41 packets.
Here is where it stores the counts before the packets are sent:
https://github.com/sonic-net/sonic-mgmt/blob/202405/tests/saitests/py3/sai_qos_tests.py#L4464

             xmit_counters_history, _ = sai_thrift_read_port_counters(
                 self.dst_client, asic_type, port_list['dst'][dst_port_id])

And within dynamically_compensate_leakout here is where they are read:
https://github.com/sonic-net/sonic-mgmt/blob/202405/tests/saitests/py3/sai_qos_tests.py#L454

    curr, _ = counter_checker(thrift_client, asic_type, check_port)
    leakout_num = curr[check_field] - prev[check_field]

The problem here is the call to dynamically_compensate_leakout is passed self.src_client as the thrift_client argument but is operating on a port in the self.dst_client:
https://github.com/sonic-net/sonic-mgmt/blob/202405/tests/saitests/py3/sai_qos_tests.py#L4551

                    dynamically_compensate_leakout(self.src_client, asic_type, sai_thrift_read_port_counters,
                                                   port_list['dst'][dst_port_id], TRANSMITTED_PKTS,
                                                   xmit_counters_history, self, src_port_id, pkt, 40)

In this failure case I can see that the dst_port_id value is used on both asics:

(Pdb) port_list['src'][32]
4294967297
(Pdb) port_list['dst'][dst_port_id]
4294967297

If I dump the TX_OK of port 32 on the src asic I get:

(Pdb) sai_thrift_read_port_counters(self.src_client, asic_type, port_list['dst'][dst_port_id])[0][TRANSMITTED_PKTS]
2598800

On the dst asic I get:

(Pdb) sai_thrift_read_port_counters(self.dst_client, asic_type, port_list['dst'][dst_port_id])[0][TRANSMITTED_PKTS]
422409

This is where the massive compensate packet number comes from:

(Pdb) xmit_counters_history[TRANSMITTED_PKTS]
422261
2598800 - 422261 = 2176539

Results you see

We poll the incorrect asic/dut client in dynamically_compensate_leakout

Results you expected to see

We should poll the same asic/dut client in dynamically_compensate_leakout as the port we're referencing

Is it platform specific

generic

Relevant log output

No response

Output of show version

No response

Attach files (if any)

No response

@arista-nwolfe
Copy link
Contributor Author

@arlakshm @vmittal-msft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
1 participant