Skip to content

Conversation

@hoytak
Copy link
Collaborator

@hoytak hoytak commented Jul 10, 2025

This PR implements a robust controller for adaptively adjusting the concurrency of upload and download transfers.

This controller uses two statistical models that adapt over time using exponentially weighted moving averages. The first is a model that predicts the overall current bandwith, and the second is a model of the deviance between the actual transfer time and the predicted time based on a linear scaling of the concurrency.

The key idea is this:

  1. When a network connection is underutilized, the latency scales sublinearly with the number of parallel connections. In other words, adding another transfer does not affect the speed of the other transfers significantly.
  2. When a network connection is fully utilized, then the latency scales linearly with the concurrency. In other words, adding increasing the concurrency from N to N+1 would cause the latency of all the other transfers to increase by a factor of (N+1) / N.
  3. When a network connection is oversaturated, then the latency scales superlinearly. In other words, adding an additional connection causes the overall throughput to decrease.

Now, because latency is a noisy observation, we track a running clipped average of the deviance between predicted time and the actual time, and increase the concurrency when this is reliably sublinear and decrease it when it is superlinear. This model uses clipped observations to avoid having a single observation be too heavily weighted; failures and retries max out the deviance.

@hoytak hoytak force-pushed the hoytak/250701-rate-adjustor branch from daedbb3 to b7423e4 Compare July 10, 2025 23:04
@hoytak hoytak requested a review from bpronan July 11, 2025 00:52
@hoytak hoytak changed the title Automatic concurrency adjustment for parallel uploads and downloads Automatic concurrency adjustment for transfers Jul 11, 2025
Copy link
Collaborator

@seanses seanses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the controller logic, but experiments are needed before this becomes a convincing solution. Places I have doubt on:

  • A success signal is defined as a 200 status code & transfer finishes in a linear interpolated hardcoded time limit. This yields different speed expectations for xorbs with different size, e.g. 6.9 Mbps for a 10 MB xorb and 25.6 Mbps for a 64 MB xorb. In practice they should have the same speed.

  • After 90% success signals within a observation window, controller attempts to increment concurrency by 1. It's unclear what's target for this controller to maximize. 25.6 Mbps bandwidth can be easily reached, this control logic means the concurrency will shoot to the max value(100). But the overall throughput with 100 concurrent flows is not guaranteed to be greater than, say 20 flows. And it's unknown that with 100 concurrent flows if a single flow speed will drop below 25.6 Mbps to trigger a decrease signal. This implies that users need to manually tune the speed expectation argument based on their network condition, and thus provides no better experience than tuning the concurrency directly.

@hoytak
Copy link
Collaborator Author

hoytak commented Jul 15, 2025

I understand the controller logic, but experiments are needed before this becomes a convincing solution. Places I have doubt on:

  • A success signal is defined as a 200 status code & transfer finishes in a linear interpolated hardcoded time limit. This yields different speed expectations for xorbs with different size, e.g. 6.9 Mbps for a 10 MB xorb and 25.6 Mbps for a 64 MB xorb. In practice they should have the same speed.

I definitely agree it should have more testing, and I'll try to set some things up to do that. Note with the speed calculations, there is a constant time overhead due to server processing as well which I was trying to account for in those numbers. Granted, though, they are empirically calculated.

  • After 90% success signals within a observation window, controller attempts to increment concurrency by 1. It's unclear what's target for this controller to maximize. 25.6 Mbps bandwidth can be easily reached, this control logic means the concurrency will shoot to the max value(100). But the overall throughput with 100 concurrent flows is not guaranteed to be greater than, say 20 flows. And it's unknown that with 100 concurrent flows if a single flow speed will drop below 25.6 Mbps to trigger a decrease signal. This implies that users need to manually tune the speed expectation argument based on their network condition, and thus provides no better experience than tuning the concurrency directly.

It's not really accurate to think of this in terms of speed, but rather in terms of whether or not the bandwidth is saturated, and then to back off before it's overloaded. There's a wide range of concurrency values where it remains saturated. In this window, the time per transfer increases due to congestion, which is fine, until it becomes too congested and things start getting dropped or failing.

The idea is to increase it to the point where we're confident it's saturated, but before problems arise. The early detection of problems is indicated given by the latency for a connection increasing past a threshold. That said, the value of this threshold is rather arbitrary and tuned for my connection, so I think I'm going to change that part to be a bit more automatic.

@hoytak hoytak force-pushed the hoytak/250701-rate-adjustor branch from 4719995 to 30372a0 Compare October 23, 2025 15:05
@hoytak hoytak force-pushed the hoytak/250701-rate-adjustor branch from b1cbe39 to f990dc7 Compare November 19, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants