Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing significant delays (>30 seconds) on "handover" to subflow #554

Closed
4 tasks done
FrankLorenz opened this issue Mar 12, 2025 · 11 comments
Closed
4 tasks done
Labels
bug pm path-manager

Comments

@FrankLorenz
Copy link

Pre-requisites

  • A similar issue has not been reported before.
  • mptcp.dev website does not cover my case.
  • An up-to-date kernel is being used.
  • This case is not fixed with the latest stable (or LTS) version listed on kernel.org

What did you do?

I am currently evaluating MPTCP for redundancy usage. At the moment, we are using the old "out of tree" MPTCP with the redundant scheduler to provide redundant network connectivity via two seperate networks. Because we are now migrating to a newer kernel (6.6) we will need to also migrate to the new in-kernel MPTCP.

I have a simple setup running where two of our devices are connected with two network interfaces to a switch, each interface in a different subnet. I use a simple test application where one device (client) sends a 200 byte packet each 100 ms to the other device and the other device (server) echos it back.

server client
---------------------------------------
10.100.4.129 <-> 10.100.4.130
192.168.42.129 <-> 192.168.42.130

The MPTCP configuration, set with ip mptcp is:

  • server:
    add_addr_accepted 0 subflows 1
    10.100.4.129 id 1 signal dev media1
    192.168.42.129 id 2 signal dev media2

*client:
add_addr_accepted 2 subflows 2
192.168.42.130 id 1 subflow fullmesh dev media2
10.100.4.130 id 2 subflow fullmesh dev media1

When starting the application, I can see via ip mptcp monitor on the server that the subflow is established:
[LISTENER_CREATED] saddr4=0.0.0.0 sport=40675
[ CREATED] token=03d621d1 remid=0 locid=0 saddr4=10.100.4.129 daddr4=10.100.4.130 sport=40675 dport=45436
[ ESTABLISHED] token=03d621d1 remid=0 locid=0 saddr4=10.100.4.129 daddr4=10.100.4.130 sport=40675 dport=45436
[ SF_ESTABLISHED] token=03d621d1 remid=1 locid=2 saddr4=192.168.42.129 daddr4=192.168.42.130 sport=40675 dport=41095 backup=0

What happened?

What I observe is when I disconnect the network cable on the "primary" link on the client (10.100.4.130), it takes a random amount of time until the communication gets up and running on the subflow (192.168.42.130). Sometimes it is quite smooth within less than a second, but I also observed cases where it took nearly 60 seconds until it worked again.

When I re-plug the cable communication continues without any delay.

Is this expected behaviour? My assumption was that the scheduler and path manager will detect the broken connection in a short amount of time (the RTO of a normal TCP connection is around 200 ms according to the "ss" command, but I cannot find these measures for MPTCP connections), so for me this randomly long delays look like some mis-configuration or bug.

What did you expect to have?

My expectation would be to have a more or less seamless handover to the remaining network path in less than a second.

System info: Client

root@Riedel-NSA-004A-06-04-49:/proc/sys/net/ipv4# uname -a
Linux Riedel-NSA-004A-06-04-49 6.6.70-xilinx-v2020.1-g77ffe4f31eae #1 SMP Tue Mar  4 09:27:53 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
root@Riedel-NSA-004A-06-04-49:/proc/sys/net/ipv4# cat /etc/os-release 
VERSION="1.0.0"
VERSION_LONG="1.0.0-0.4471b99"
VERSION_DEV="1.0.0-0.4471b99 release 2025-02-14-12:51 @wup-wil0617-fedora:/home/frank/projects/intercom-linux/build/boexli/tmp/work/aarch64-riedel-linux/boexli-repo/git-r0"
YOCTO="3.0.4"
XILINX="v2020.1"
SDK="1.62.0"
root@Riedel-NSA-004A-06-04-49:/proc/sys/net/ipv4# sysctl net.mptcp
net.mptcp.add_addr_timeout = 120
net.mptcp.allow_join_initial_addr_port = 1
net.mptcp.checksum_enabled = 0
net.mptcp.enabled = 1
net.mptcp.pm_type = 0
net.mptcp.scheduler = default
net.mptcp.stale_loss_cnt = 1
root@Riedel-NSA-004A-06-04-49:/proc/sys/net/ipv4# ip mptcp endpoint show
192.168.42.130 id 1 subflow fullmesh dev media2 
10.100.4.130 id 2 subflow fullmesh dev media1 
root@Riedel-NSA-004A-06-04-49:/proc/sys/net/ipv4# ip mptcp limits show
add_addr_accepted 2 subflows 2

System info: Server

root@Riedel-NSA-006A-15-B0-0F:/opt/data# uname -a
Linux Riedel-NSA-006A-15-B0-0F 6.6.70-xilinx-v2020.1-g77ffe4f31eae #1 SMP Tue Mar  4 09:27:53 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
root@Riedel-NSA-006A-15-B0-0F:/opt/data# cat /etc/os-release 
VERSION="1.0.0"
VERSION_LONG="1.0.0-0.4471b99"
VERSION_DEV="1.0.0-0.4471b99 release 2025-02-14-12:51 @wup-wil0617-fedora:/home/frank/projects/intercom-linux/build/boexli/tmp/work/aarch64-riedel-linux/boexli-repo/git-r0"
YOCTO="3.0.4"
XILINX="v2020.1"
SDK="1.62.0"
root@Riedel-NSA-006A-15-B0-0F:/opt/data# sysctl net.mptcp
net.mptcp.add_addr_timeout = 120
net.mptcp.allow_join_initial_addr_port = 1
net.mptcp.checksum_enabled = 0
net.mptcp.enabled = 1
net.mptcp.pm_type = 0
net.mptcp.scheduler = default
net.mptcp.stale_loss_cnt = 4
root@Riedel-NSA-006A-15-B0-0F:/opt/data# ip mptcp endpoint show
10.100.4.129 id 1 signal dev media1 
192.168.42.129 id 2 signal dev media2 
root@Riedel-NSA-006A-15-B0-0F:/opt/data# ip mptcp limits show
add_addr_accepted 0 subflows 1

Additional context

No response

@pabeni
Copy link

pabeni commented Mar 12, 2025

  • server:
    add_addr_accepted 0 subflows 1
    10.100.4.129 id 1 signal dev media1
    192.168.42.129 id 2 signal dev media2

*client:
add_addr_accepted 2 subflows 2
192.168.42.130 id 1 subflow fullmesh dev media2
10.100.4.130 id 2 subflow fullmesh dev media1

This configuration is a bit self-inconsistent. subflows 1 on the server will limit the total amount of additional subflows created per connection to 1. subflow fullmesh will try to create subflows from the specified endpoint towards all the announced addresses.

If you want a 'fullmesh' topology you should increase subflows limit to 3 on both the client and the server. If instead you want to ensure each connection will create a single additional subflow on each link, remove the endpoint on the client side

When starting the application, I can see via ip mptcp monitor on the server that the subflow is established:
[LISTENER_CREATED] saddr4=0.0.0.0 sport=40675
[ CREATED] token=03d621d1 remid=0 locid=0 saddr4=10.100.4.129 daddr4=10.100.4.130 sport=40675 dport=45436
[ ESTABLISHED] token=03d621d1 remid=0 locid=0 saddr4=10.100.4.129 daddr4=10.100.4.130 sport=40675 dport=45436
[ SF_ESTABLISHED] token=03d621d1 remid=1 locid=2 saddr4=192.168.42.129 daddr4=192.168.42.130 sport=40675 dport=41095 backup=0

Do you always observe this sequence of events? I suspect you should also see SF_ESTABLISHED with different address pairs.

In any case, could you please share a pcap trace for the whole connection with a long handover time?

Thanks

@FrankLorenz
Copy link
Author

Ok, I wasn't sure if I understood the options correctly and "fullmesh" was just a left-over of some tests. I removed it again, so now I have the following configuration:

  • server
    10.100.4.129 id 1 signal dev media1
    192.168.42.129 id 2 signal dev media2
    add_addr_accepted 0 subflows 2

  • client
    10.100.4.130 id 2 subflow dev media1
    192.168.42.130 id 3 subflow dev media2
    add_addr_accepted 2 subflows 2

From my understanding, the "fullmesh" option didn't make sense anyhow because both paths, while on the same switch, are in different subnets without any routing between these subnets. This also explains IMO the output I see with ip mptcp monitor - it is not possible to have SF_ESTABLISHED between different address pairs, right?

Anyhow, also with these corrected settings I observe the same behaviour.

I attached a Wireshark trace of a "bad handover" case. It is taken from a mirror port of my switch, mirroring both ports connected to the interfaces of the "server" device. You can see that there is a huge gap between packet No. 2087 and packet No. 4044 without any MPTCP communication.

handover_badCase.zip

Some more information about our system that might be important:

  • We are talking about embedded devices here that run the kernel provided by Xilinx: github.com/Xilinx/linux-xlnx.git - commit 901138a
  • We are currently in a process of upgrading the whole system to a new kernel (6.6) and a new yocto version. The 6.6 is kernel is already running but the other parts of the system are still old. I am not 100% sure if e.g. an old libc version could have impact.
  • We have an internal switch inside the device between the two MACs of the CPU and the external network connectors. The two ports of the switch (towards CPU MAC and towards network connector) are bridged for each of the two network interfaces, so this should IMO be transparent for MPTCP.

@FrankLorenz
Copy link
Author

Ok, regarding the last bullet point of my previous comment: I also tested with the exact same setup between to other embedded devices that do not have this switch and obtained the same results - so the switch does not seem to have any influence.

@FrankLorenz
Copy link
Author

I now also tested the same setup, but defining the "secondary" endpoints link with the 192.168.* addresses as "backup".

  • server
    # ip mptcp endpoint show
    10.100.4.129 id 5 signal dev media1
    192.168.42.129 id 6 signal backup dev media2

  • client
    # ip mptcp endpoint show
    10.100.4.130 id 4 subflow dev media1
    192.168.42.130 id 5 subflow backup dev media2

The beavior is different in this case. I get this "stuck" connection when removing the 10.100.4.* link every time in this case. Additionally, looks like one or two single packets come through every few seconds before the connection recovers over the backup path after ~10 to 30 seconds.
I attached the wireshark trace for this. You can see these "single packages" there (captured packages 534 and 544 / 674 and 681) before the connection becomes performant 16 seconds later via the backup path (package 911).
Perhaps this helps to dig into the issue a bit more?

handover_badCase_backup.zip

@FrankLorenz
Copy link
Author

Ok, I think I have some clue what is happenening.
Up to now, I always tested with the following limits setting:
'add_addr_accepted 0 subflows 2 (server) 'add_addr_accepted 2 subflows 2 (client)

But because I only want two paths established,
10.100.4.129 <---> 10.100.4.130
and
`192.168.42.129 <---> 192.168.42.130'

As far as I understand I can go with:
'add_addr_accepted 0 subflows 1 (server) 'add_addr_accepted 1 subflows 1 (client)

If I reduce the subflows from 2 to 1 I do cannot observe the issue anymore on my test setup.

In the wireshark trace I attached in my previous post you can see that there are TCP SYN retransmissions for a connection 192.158.42.130 <--> 10.100.4.129. This connection is not possible because there is no routing between these networks. but the retransmissions go on for a long time (at least 40 seconds). May it be that this unestablished path disturbs the scheduler or path manager in some way?

@matttbe matttbe removed the triage label Mar 18, 2025
@matttbe
Copy link
Member

matttbe commented Mar 18, 2025

Hi @FrankLorenz,

Thank you for your different replies!

Just to make sure I understand your issue, is the following correct?

  • both the server and the client have two IP addresses (A and B with A being linked to the initial path)
  • with the current routing, only two paths are possible: A-A and B-B, but not A-B and B-A

The behaviour of the (default) in-kernel path-manager is described on our website. In short for your case:

  • On the client side, no need to add an endpoint for address B: the in-kernel PM will try to use it only to establish new subflows to the original address of the server (A). Note that the server could also set sysctl -w net.mptcp.allow_join_initial_addr_port=0 to force the client not to create additional subflows to A.
  • The server will announce its second address (B) thanks to the signal endpoint you have configured, all good there then.
  • On the client side, a new path will be created when receiving the address announcement from the server if the add_addr_accepted limit is above 0. The PM will pick the source IP depending on the routing configuration. So you might need to add a route on the client side to make sure that when trying to reach B on the server side, it will pick B on the client side. Check with ip route get <server's B IP address>.

Some notes:

  • the server announces one IP address at a time. That's OK for you: only one IP to announce.
  • if you reached the limits, the client will need to wait for a subflow to be removed (e.g. timeout to establish the connection, proper disconnection, etc.) to try again. That might explain the delays.
  • I think you can also change the routing configuration on the client side to forbid some paths (ip rule from <IP> table 42 and ip route add unreachable default table 42 (or prohibit instead of unreachable?)
  • it is recommended to increase the two limits by one to allow reconnections (subflows might not be removed directly, also the subflow limit is for additional subflows, so if you disconnect the initial subflow and reconnect it, it is then seen as an additional subflow, etc.)
  • the PM will try to use the different MPTCP endpoints in the ID order. It should not change anything for you here (only one endpoint should be used to announce IP addresses).
  • if path B-B is removed, the client will need to wait for the server to announce it again.

Does this help to better understand the PM and fix your issue?

@FrankLorenz
Copy link
Author

FrankLorenz commented Mar 18, 2025

Hi @matttbe , thanks for the detailed reply.

  • both the server and the client have two IP addresses (A and B with A being linked to the initial path)
  • with the current routing, only two paths are possible: A-A and B-B, but not A-B and B-A

Yes, correct. Our goal is to have a redundant connection between the devices over two completely independent networks.

I removed the second enpoint on client side and this seems to solve the issue. After a connection is established, I now see an "implicit" endpoint:
root@Riedel-RSP-1232HL-08-08-5B:~# ip mptcp endpoint show
10.100.4.130 id 4 subflow dev media1
192.168.42.130 id 5 implicit

When I now disconnect the primary link, the handover happens in less than a second which is fine.

I did read the docs for the Path Manager before doing my tests, but I misinterpreted the statement:

subflow: The endpoint will be used to create an additional subflow using the given source IP address. A client would typically do this.

I interpreted this as "It is necessary to set a 'ip mptcp endpoint [...] subflow' on the secondary interface of the client to enable this interface for MPTCP paths/subflows in general".

The term "subflow" is IMO a little bit ambigous in the documentation at all, because e.g. on the top picture of the Path Manager doc there is the "primary path" also called "initial subflow", while this "initial subflow" does not count for the limits you set with ip mptcp limits

For me, the issue is solved and IMO you can close the ticket, but it would be great if you could answer me this two questions I did not find reliable information to:

  1. I understand that for MPTCP v1, there is only a "default scheduler" implemented, no redundant scheduler is part of the mptcp implementation. There is an interface to let you add your own scheduler from user-space. Is this correct? If so, is there any redundant scheduler available yet? I also see open issues regarding this interface - it is stable enough to implement such a "user-land" scheduler?
  2. I observe that, if I first establish the connection (run my client/server application) with my secondary link disconnected and establishing this link later, the secondary subflow isn't created, meaning I have no redundancy. Do I need a mptcp daemon to establish a secondary subflow on runtime or is there some other way?

matttbe added a commit to multipath-tcp/mptcp.dev that referenced this issue Mar 18, 2025
The 'subflow' endpoint description might be confusing: is it needed to
be specified to create subflows to additional IP announced by the other
peer. Mentioning that it is only to create subflows to the other peer's
IP address should help clarifying this.

Link: multipath-tcp/mptcp_net-next#554
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
@matttbe matttbe added the pm path-manager label Mar 22, 2025
@matttbe
Copy link
Member

matttbe commented Mar 22, 2025

  • both the server and the client have two IP addresses (A and B with A being linked to the initial path)
  • with the current routing, only two paths are possible: A-A and B-B, but not A-B and B-A

Yes, correct. Our goal is to have a redundant connection between the devices over two completely independent networks.

I removed the second enpoint on client side and this seems to solve the issue. After a connection is established, I now see an "implicit" endpoint: root@Riedel-RSP-1232HL-08-08-5B:~# ip mptcp endpoint show 10.100.4.130 id 4 subflow dev media1 192.168.42.130 id 5 implicit

When I now disconnect the primary link, the handover happens in less than a second which is fine.

Good news! For your case, I guess a feature like #503 (not implemented yet) would be useful: you could easily tell that the client's second address can be used to create additional subflows with additional addresses from the server.

I did read the docs for the Path Manager before doing my tests, but I misinterpreted the statement:

subflow: The endpoint will be used to create an additional subflow using the given source IP address. A client would typically do this.

I interpreted this as "It is necessary to set a 'ip mptcp endpoint [...] subflow' on the secondary interface of the client to enable this interface for MPTCP paths/subflows in general".

OK, indeed, that's not very clear. This behaviour is explained in more details a bit below. We can always improve the doc. Do you think the following modification would help? multipath-tcp/mptcp.dev#51

The term "subflow" is IMO a little bit ambigous in the documentation at all, because e.g. on the top picture of the Path Manager doc there is the "primary path" also called "initial subflow", while this "initial subflow" does not count for the limits you set with ip mptcp limits

OK. Do you think the other modification from multipath-tcp/mptcp.dev#51 can help?

For me, the issue is solved and IMO you can close the ticket, but it would be great if you could answer me this two questions I did not find reliable information to:

1. I understand that for MPTCP v1, there is only a "default scheduler" implemented, no redundant scheduler is part of the mptcp implementation. There is an interface to let you add your own scheduler from user-space. Is this correct? If so, is there any redundant scheduler available yet? I also see open issues regarding this interface - it is stable enough to implement such a "user-land" scheduler?

Only one scheduler for the moment. We are working on allow the creation of additional schedulers in BPF (#75), but this is not ready yet.

Note that it is technically possible to have a redundant packet schedulers like we had in the old fork, the two implementations will suffer from some protocol limitations that might not make this feature that interesting, see here. But I can understand that MPTCP might be easy to deploy, and the limitation might not be a problem in some cases.

2. I observe that, if I first establish the connection (run my client/server application) with my secondary link disconnected and establishing this link later, the secondary subflow isn't created, meaning I have no redundancy. Do I need a mptcp daemon to establish a secondary subflow on runtime or is there some other way?

Mmh, currently the in-kernel doesn't explicitly remember the addresses announced by the other peer. If, when announced, the peer cannot do anything with it, it will not be used later on. I guess that's what you have here, right?

A userspace daemon could support that. Or the in-kernel PM could be modified to support that (linked to #496)

If all the questions have been answered, please close this ticket.

@FrankLorenz
Copy link
Author

Hi Matt,
I think your additions to the docs make it a little more clear. I still think it is not "straight forward" to understand what has to be done to achieve what we need, but this might be caused by the fact that our use-case is not the "normal" use case most people have in mind when looking into MPTCP.

My questions are answered and therefore I close this ticket.

@sskras
Copy link

sskras commented Mar 26, 2025

@FrankLorenz commented 44 minutes ago:

this might be caused by the fact that our use-case is not the "normal" use case most people have in mind when looking into MPTCP.

This also is similar to my issue I had a year ago (in 2024), when I tried doing minimal MPTCP research for my thesis.

IIUC, in an enterprise DCNs it is perfectly fine and expected for the most subnets to have no routes between or to be firewalled each from the other (especially with the Zero Trust security model).

So the scenario seems to be a very fair request to me. My support here. :)

@sskras
Copy link

sskras commented Mar 26, 2025

PS, @FrankLorenz: you mentioned rmptcp, the Redudant scheduler for MPTCP. Does that repository target v0 or v1 implementation, or maybe both?
https://www.mptcp.dev/faq.html#mptcpv0-vs-mptcpv1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug pm path-manager
Projects
None yet
Development

No branches or pull requests

4 participants