Skip to content

Commit bde6d78

Browse files
Daniel Borkmanngregkh
Daniel Borkmann
authored andcommitted
Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
[ Upstream commit 362d520 ] This reverts commit ef2820a ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") as it introduced a serious performance regression on SCTP over IPv4 and IPv6, though a not as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs. Current state: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 17:56:21 GMT Connecting to host 192.168.241.3, port 5201 Cookie: Lab200slot2.1397238981.812898.548918 [ 4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.09 sec 20.8 MBytes 161 Mbits/sec [ 4] 1.09-2.13 sec 10.8 MBytes 86.8 Mbits/sec [ 4] 2.13-3.15 sec 3.57 MBytes 29.5 Mbits/sec [ 4] 3.15-4.16 sec 4.33 MBytes 35.7 Mbits/sec [ 4] 4.16-6.21 sec 10.4 MBytes 42.7 Mbits/sec [ 4] 6.21-6.21 sec 0.00 Bytes 0.00 bits/sec [ 4] 6.21-7.35 sec 34.6 MBytes 253 Mbits/sec [ 4] 7.35-11.45 sec 22.0 MBytes 45.0 Mbits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-11.45 sec 0.00 Bytes 0.00 bits/sec [ 4] 11.45-12.51 sec 16.0 MBytes 126 Mbits/sec [ 4] 12.51-13.59 sec 20.3 MBytes 158 Mbits/sec [ 4] 13.59-14.65 sec 13.4 MBytes 107 Mbits/sec [ 4] 14.65-16.79 sec 33.3 MBytes 130 Mbits/sec [ 4] 16.79-16.79 sec 0.00 Bytes 0.00 bits/sec [ 4] 16.79-17.82 sec 5.94 MBytes 48.7 Mbits/sec (etc) [root@Lab200slot2 ~]# iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64 Time: Fri, 11 Apr 2014 19:08:41 GMT Connecting to host 2001:db8:0:f101::1, port 5201 Cookie: Lab200slot2.1397243321.714295.2b3f7c [ 4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201 Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 169 MBytes 1.42 Gbits/sec [ 4] 1.00-2.00 sec 201 MBytes 1.69 Gbits/sec [ 4] 2.00-3.00 sec 188 MBytes 1.58 Gbits/sec [ 4] 3.00-4.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 4.00-5.00 sec 165 MBytes 1.39 Gbits/sec [ 4] 5.00-6.00 sec 199 MBytes 1.67 Gbits/sec [ 4] 6.00-7.00 sec 163 MBytes 1.36 Gbits/sec [ 4] 7.00-8.00 sec 174 MBytes 1.46 Gbits/sec [ 4] 8.00-9.00 sec 193 MBytes 1.62 Gbits/sec [ 4] 9.00-10.00 sec 196 MBytes 1.65 Gbits/sec [ 4] 10.00-11.00 sec 157 MBytes 1.31 Gbits/sec [ 4] 11.00-12.00 sec 175 MBytes 1.47 Gbits/sec [ 4] 12.00-13.00 sec 192 MBytes 1.61 Gbits/sec [ 4] 13.00-14.00 sec 199 MBytes 1.67 Gbits/sec (etc) After patch: [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60 iperf version 3.0.1 (10 January 2014) Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64 Time: Mon, 14 Apr 2014 16:40:48 GMT Connecting to host 192.168.240.3, port 5201 Cookie: Lab200slot2.1397493648.413274.65e131 [ 4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201 Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test [ ID] Interval Transfer Bandwidth [ 4] 0.00-1.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 1.00-2.00 sec 239 MBytes 2.01 Gbits/sec [ 4] 2.00-3.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 3.00-4.00 sec 239 MBytes 2.00 Gbits/sec [ 4] 4.00-5.00 sec 245 MBytes 2.05 Gbits/sec [ 4] 5.00-6.00 sec 240 MBytes 2.01 Gbits/sec [ 4] 6.00-7.00 sec 240 MBytes 2.02 Gbits/sec [ 4] 7.00-8.00 sec 239 MBytes 2.01 Gbits/sec With the reverted patch applied, the SCTP/IPv4 performance is back to normal on latest upstream for IPv4 and IPv6 and has same throughput as 3.4.2 test kernel, steady and interval reports are smooth again. Fixes: ef2820a ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer") Reported-by: Peter Butler <[email protected]> Reported-by: Dongsheng Song <[email protected]> Reported-by: Fengguang Wu <[email protected]> Tested-by: Peter Butler <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Matija Glavinic Pecotic <[email protected]> Cc: Alexander Sverdlin <[email protected]> Cc: Vlad Yasevich <[email protected]> Acked-by: Vlad Yasevich <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
1 parent 72beb56 commit bde6d78

File tree

5 files changed

+87
-25
lines changed

5 files changed

+87
-25
lines changed

include/net/sctp/structs.h

+13-1
Original file line numberDiff line numberDiff line change
@@ -1653,6 +1653,17 @@ struct sctp_association {
16531653
/* This is the last advertised value of rwnd over a SACK chunk. */
16541654
__u32 a_rwnd;
16551655

1656+
/* Number of bytes by which the rwnd has slopped. The rwnd is allowed
1657+
* to slop over a maximum of the association's frag_point.
1658+
*/
1659+
__u32 rwnd_over;
1660+
1661+
/* Keeps treack of rwnd pressure. This happens when we have
1662+
* a window, but not recevie buffer (i.e small packets). This one
1663+
* is releases slowly (1 PMTU at a time ).
1664+
*/
1665+
__u32 rwnd_press;
1666+
16561667
/* This is the sndbuf size in use for the association.
16571668
* This corresponds to the sndbuf size for the association,
16581669
* as specified in the sk->sndbuf.
@@ -1881,7 +1892,8 @@ void sctp_assoc_update(struct sctp_association *old,
18811892
__u32 sctp_association_get_next_tsn(struct sctp_association *);
18821893

18831894
void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
1884-
void sctp_assoc_rwnd_update(struct sctp_association *, bool);
1895+
void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
1896+
void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
18851897
void sctp_assoc_set_primary(struct sctp_association *,
18861898
struct sctp_transport *);
18871899
void sctp_assoc_del_nonprimary_peers(struct sctp_association *,

net/sctp/associola.c

+65-17
Original file line numberDiff line numberDiff line change
@@ -1396,35 +1396,44 @@ static inline bool sctp_peer_needs_update(struct sctp_association *asoc)
13961396
return false;
13971397
}
13981398

1399-
/* Update asoc's rwnd for the approximated state in the buffer,
1400-
* and check whether SACK needs to be sent.
1401-
*/
1402-
void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
1399+
/* Increase asoc's rwnd by len and send any window update SACK if needed. */
1400+
void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
14031401
{
1404-
int rx_count;
14051402
struct sctp_chunk *sack;
14061403
struct timer_list *timer;
14071404

1408-
if (asoc->ep->rcvbuf_policy)
1409-
rx_count = atomic_read(&asoc->rmem_alloc);
1410-
else
1411-
rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
1405+
if (asoc->rwnd_over) {
1406+
if (asoc->rwnd_over >= len) {
1407+
asoc->rwnd_over -= len;
1408+
} else {
1409+
asoc->rwnd += (len - asoc->rwnd_over);
1410+
asoc->rwnd_over = 0;
1411+
}
1412+
} else {
1413+
asoc->rwnd += len;
1414+
}
14121415

1413-
if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
1414-
asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
1415-
else
1416-
asoc->rwnd = 0;
1416+
/* If we had window pressure, start recovering it
1417+
* once our rwnd had reached the accumulated pressure
1418+
* threshold. The idea is to recover slowly, but up
1419+
* to the initial advertised window.
1420+
*/
1421+
if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
1422+
int change = min(asoc->pathmtu, asoc->rwnd_press);
1423+
asoc->rwnd += change;
1424+
asoc->rwnd_press -= change;
1425+
}
14171426

1418-
pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
1419-
__func__, asoc, asoc->rwnd, rx_count,
1420-
asoc->base.sk->sk_rcvbuf);
1427+
pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
1428+
__func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
1429+
asoc->a_rwnd);
14211430

14221431
/* Send a window update SACK if the rwnd has increased by at least the
14231432
* minimum of the association's PMTU and half of the receive buffer.
14241433
* The algorithm used is similar to the one described in
14251434
* Section 4.2.3.3 of RFC 1122.
14261435
*/
1427-
if (update_peer && sctp_peer_needs_update(asoc)) {
1436+
if (sctp_peer_needs_update(asoc)) {
14281437
asoc->a_rwnd = asoc->rwnd;
14291438

14301439
pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
@@ -1446,6 +1455,45 @@ void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool update_peer)
14461455
}
14471456
}
14481457

1458+
/* Decrease asoc's rwnd by len. */
1459+
void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
1460+
{
1461+
int rx_count;
1462+
int over = 0;
1463+
1464+
if (unlikely(!asoc->rwnd || asoc->rwnd_over))
1465+
pr_debug("%s: association:%p has asoc->rwnd:%u, "
1466+
"asoc->rwnd_over:%u!\n", __func__, asoc,
1467+
asoc->rwnd, asoc->rwnd_over);
1468+
1469+
if (asoc->ep->rcvbuf_policy)
1470+
rx_count = atomic_read(&asoc->rmem_alloc);
1471+
else
1472+
rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
1473+
1474+
/* If we've reached or overflowed our receive buffer, announce
1475+
* a 0 rwnd if rwnd would still be positive. Store the
1476+
* the potential pressure overflow so that the window can be restored
1477+
* back to original value.
1478+
*/
1479+
if (rx_count >= asoc->base.sk->sk_rcvbuf)
1480+
over = 1;
1481+
1482+
if (asoc->rwnd >= len) {
1483+
asoc->rwnd -= len;
1484+
if (over) {
1485+
asoc->rwnd_press += asoc->rwnd;
1486+
asoc->rwnd = 0;
1487+
}
1488+
} else {
1489+
asoc->rwnd_over = len - asoc->rwnd;
1490+
asoc->rwnd = 0;
1491+
}
1492+
1493+
pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
1494+
__func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
1495+
asoc->rwnd_press);
1496+
}
14491497

14501498
/* Build the bind address list for the association based on info from the
14511499
* local endpoint and the remote peer.

net/sctp/sm_statefuns.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -6178,7 +6178,7 @@ static int sctp_eat_data(const struct sctp_association *asoc,
61786178
* PMTU. In cases, such as loopback, this might be a rather
61796179
* large spill over.
61806180
*/
6181-
if ((!chunk->data_accepted) && (!asoc->rwnd ||
6181+
if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
61826182
(datalen > asoc->rwnd + asoc->frag_point))) {
61836183

61846184
/* If this is the next TSN, consider reneging to make

net/sctp/socket.c

+6
Original file line numberDiff line numberDiff line change
@@ -2115,6 +2115,12 @@ static int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
21152115
sctp_skb_pull(skb, copied);
21162116
skb_queue_head(&sk->sk_receive_queue, skb);
21172117

2118+
/* When only partial message is copied to the user, increase
2119+
* rwnd by that amount. If all the data in the skb is read,
2120+
* rwnd is updated when the event is freed.
2121+
*/
2122+
if (!sctp_ulpevent_is_notification(event))
2123+
sctp_assoc_rwnd_increase(event->asoc, copied);
21182124
goto out;
21192125
} else if ((event->msg_flags & MSG_NOTIFICATION) ||
21202126
(event->msg_flags & MSG_EOR))

net/sctp/ulpevent.c

+2-6
Original file line numberDiff line numberDiff line change
@@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
989989
skb = sctp_event2skb(event);
990990
/* Set the owner and charge rwnd for bytes received. */
991991
sctp_ulpevent_set_owner(event, asoc);
992-
sctp_assoc_rwnd_update(asoc, false);
992+
sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
993993

994994
if (!skb->data_len)
995995
return;
@@ -1011,7 +1011,6 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
10111011
{
10121012
struct sk_buff *skb, *frag;
10131013
unsigned int len;
1014-
struct sctp_association *asoc;
10151014

10161015
/* Current stack structures assume that the rcv buffer is
10171016
* per socket. For UDP style sockets this is not true as
@@ -1036,11 +1035,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
10361035
}
10371036

10381037
done:
1039-
asoc = event->asoc;
1040-
sctp_association_hold(asoc);
1038+
sctp_assoc_rwnd_increase(event->asoc, len);
10411039
sctp_ulpevent_release_owner(event);
1042-
sctp_assoc_rwnd_update(asoc, true);
1043-
sctp_association_put(asoc);
10441040
}
10451041

10461042
static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)

0 commit comments

Comments
 (0)