Automatic concurrency adjustment for transfers #410

hoytak · 2025-07-10T23:03:56Z

This PR implements a robust controller for adaptively adjusting the concurrency of upload and download transfers.

This controller uses two statistical models that adapt over time using exponentially weighted moving averages. The first is a model that predicts the overall current bandwith, and the second is a model of the deviance between the actual transfer time and the predicted time based on a linear scaling of the concurrency.

The key idea is this:

When a network connection is underutilized, the latency scales sublinearly with the number of parallel connections. In other words, adding another transfer does not affect the speed of the other transfers significantly.
When a network connection is fully utilized, then the latency scales linearly with the concurrency. In other words, adding increasing the concurrency from N to N+1 would cause the latency of all the other transfers to increase by a factor of (N+1) / N.
When a network connection is oversaturated, then the latency scales superlinearly. In other words, adding an additional connection causes the overall throughput to decrease.

Now, because latency is a noisy observation, we track a running clipped average of the deviance between predicted time and the actual time, and increase the concurrency when this is reliably sublinear and decrease it when it is superlinear. This model uses clipped observations to avoid having a single observation be too heavily weighted; failures and retries max out the deviance.

seanses

I understand the controller logic, but experiments are needed before this becomes a convincing solution. Places I have doubt on:

A success signal is defined as a 200 status code & transfer finishes in a linear interpolated hardcoded time limit. This yields different speed expectations for xorbs with different size, e.g. 6.9 Mbps for a 10 MB xorb and 25.6 Mbps for a 64 MB xorb. In practice they should have the same speed.
After 90% success signals within a observation window, controller attempts to increment concurrency by 1. It's unclear what's target for this controller to maximize. 25.6 Mbps bandwidth can be easily reached, this control logic means the concurrency will shoot to the max value(100). But the overall throughput with 100 concurrent flows is not guaranteed to be greater than, say 20 flows. And it's unknown that with 100 concurrent flows if a single flow speed will drop below 25.6 Mbps to trigger a decrease signal. This implies that users need to manually tune the speed expectation argument based on their network condition, and thus provides no better experience than tuning the concurrency directly.

cas_client/src/constants.rs

data/src/file_upload_session.rs

data/src/shard_interface.rs

cas_client/src/constants.rs

utils/src/adjustable_semaphore.rs

hoytak · 2025-07-15T23:32:05Z

I understand the controller logic, but experiments are needed before this becomes a convincing solution. Places I have doubt on:

A success signal is defined as a 200 status code & transfer finishes in a linear interpolated hardcoded time limit. This yields different speed expectations for xorbs with different size, e.g. 6.9 Mbps for a 10 MB xorb and 25.6 Mbps for a 64 MB xorb. In practice they should have the same speed.

I definitely agree it should have more testing, and I'll try to set some things up to do that. Note with the speed calculations, there is a constant time overhead due to server processing as well which I was trying to account for in those numbers. Granted, though, they are empirically calculated.

After 90% success signals within a observation window, controller attempts to increment concurrency by 1. It's unclear what's target for this controller to maximize. 25.6 Mbps bandwidth can be easily reached, this control logic means the concurrency will shoot to the max value(100). But the overall throughput with 100 concurrent flows is not guaranteed to be greater than, say 20 flows. And it's unknown that with 100 concurrent flows if a single flow speed will drop below 25.6 Mbps to trigger a decrease signal. This implies that users need to manually tune the speed expectation argument based on their network condition, and thus provides no better experience than tuning the concurrency directly.

It's not really accurate to think of this in terms of speed, but rather in terms of whether or not the bandwidth is saturated, and then to back off before it's overloaded. There's a wide range of concurrency values where it remains saturated. In this window, the time per transfer increases due to congestion, which is fine, until it becomes too congested and things start getting dropped or failing.

The idea is to increase it to the point where we're confident it's saturated, but before problems arise. The early detection of problems is indicated given by the latency for a connection increasing past a threshold. That said, the value of this threshold is rather arbitrary and tuned for my connection, so I think I'm going to change that part to be a bit more automatic.

cas_client/src/constants.rs

cas_client/src/retry_wrapper.rs

utils/src/adjustable_semaphore.rs

…justor

…ent is larger than the networking stack buffers to avoid falsely inflated bandwidth observations on small transmissions.

hoytak requested review from assafvayner, jgodlew and seanses July 10, 2025 23:03

hoytak force-pushed the hoytak/250701-rate-adjustor branch from daedbb3 to b7423e4 Compare July 10, 2025 23:04

hoytak requested a review from bpronan July 11, 2025 00:52

hoytak temporarily deployed to release July 11, 2025 01:01 — with GitHub Actions Inactive

hoytak changed the title ~~Automatic concurrency adjustment for parallel uploads and downloads~~ Automatic concurrency adjustment for transfers Jul 11, 2025

seanses reviewed Jul 15, 2025

View reviewed changes

assafvayner reviewed Jul 16, 2025

View reviewed changes

cas_client/src/constants.rs Outdated Show resolved Hide resolved

assafvayner reviewed Jul 18, 2025

View reviewed changes

cas_client/src/retry_wrapper.rs Show resolved Hide resolved

utils/src/adjustable_semaphore.rs Outdated Show resolved Hide resolved

hoytak force-pushed the hoytak/250701-rate-adjustor branch from 4719995 to 30372a0 Compare October 23, 2025 15:05

hoytak added 2 commits November 18, 2025 23:28

Adaptive concurrency.

1178109

Updated rates.

f990dc7

hoytak force-pushed the hoytak/250701-rate-adjustor branch from b1cbe39 to f990dc7 Compare November 19, 2025 19:04

hoytak added 13 commits November 19, 2025 15:34

Merge remote-tracking branch 'origin/main' into hoytak/250701-rate-ad…

42e5d4e

…justor

Update.

fa6ab48

Update.

c6328ec

Fixed clippy errors.

d3d28dd

Network simulation readme improvement.

67719bb

updated comments

7913388

Updates.

818e273

Added a utility script to redo the summaries.

4192c2b

Almost there.

f30bc06

Updated clippy warnings.

17bbcff

Merge remote-tracking branch 'origin/main' into hoytak/250701-rate-ad…

3eacc21

…justor

Merge remote-tracking branch 'origin/main' into hoytak/250701-rate-ad…

f73ba41

…justor

Updated model to avoid making adjustments until the number of bytes s…

47d9c71

…ent is larger than the networking stack buffers to avoid falsely inflated bandwidth observations on small transmissions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatic concurrency adjustment for transfers #410

Automatic concurrency adjustment for transfers #410

Uh oh!

hoytak commented Jul 10, 2025 •

edited

Loading

Uh oh!

seanses left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hoytak commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Automatic concurrency adjustment for transfers #410

Are you sure you want to change the base?

Automatic concurrency adjustment for transfers #410

Uh oh!

Conversation

hoytak commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanses left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hoytak commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hoytak commented Jul 10, 2025 •

edited

Loading

seanses left a comment •

edited

Loading