[rust] cache-aware DP - approx tree #1934

ByronHsu · 2024-11-06T08:01:56Z

1. Algorithm Overview

We propose a method to approximate worker-side radix trees on the router side using request and response information flow. The router maintains approximated trees ("approx trees") that mirror the cache state of each worker's radix tree ("worker tree").

Core Algorithm

Given N workers, the algorithm operates as follows:

Initialize N approx trees on the router, one corresponding to each worker
For each incoming request:
- Match request's input_ids against all approx trees
- Select the worker with highest cache match
- Speculation Phase: Before forwarding the request, insert input_ids into the selected approx tree
- Correction Phase: Upon receiving the worker's response, remove non-cached tokens from the approx tree

The speculation phase anticipates the worker tree's future state, while the correction phase aligns the approx tree with the worker's actual cache state. This forward-backward mechanism enables continuous self-adjustment of the approximation.

Load Balancing Strategies

1. Cache Threshold with Shortest Queue

Maintain N approx trees on the router
For each request:
- Calculate cache match rates across all workers
- If highest match rate > threshold: Select highest-matching worker
- Else: Select worker with shortest request queue
- Apply speculation and correction phases

2. Variance-Based Load Balancing

Maintain N approx trees on the router
For each request:
- Calculate mean (μ) and standard deviation (σ) of queue lengths
- Filter out workers with queue length > μ + Kσ
- Among remaining workers, select highest cache match
- Apply speculation and correction phases

Parameters

K: variance threshold parameter
μ: mean queue length
σ: standard deviation of queue lengths

See details of google doc version here

2. Changes

Implemented the core algorithm and "Cache Threshold with Shortest Queue" LB strategy in rust
Rust python binding for approx tree
Rust cli for approx tree
A benchmark file which simulates the case of long system prompt + relatively shorter QA.
Demo python binding usage under py_src

main.py: launch a minimal router server with existing worker
dp_demo.py: mimic the case like the current --dp. Launch --dp workers and a router. This part of code can be moved to sglang core afterwards and let sglang depends on sglang-router. Users can install sglang-router as an optional to sglang.

Benchmark

# original
python -m sglang.launch_server --model-path <model path> --host 127.0.0.1 --port 30000 --dp 8

# approx tree
python dp_demo.py --model-path <model path> --local-tokenizer-path <local tokenizer.json path> --dp 8

# benchmark client
python long_prompt_multi_turn.py --port 30000  --tokenizer <tokenizer path>

We can see clear improvement from the original method due to the high cache hitting.

Reference: google sheet

Follow-up

There are still many follow-ups. Notably

Python binding CI publishing setup (sglang-router pypi)
Various optimization in rust (marked as TODO in the code`)
Implementation of the correction phase and Variance-Based Load Balancing

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

faster than --dp wip wip wip

jayzhan211 · 2024-11-08T07:36:49Z

rust/src/router.rs

+    ApproxTree {
+        worker_urls: Vec<String>,
+        // TODO: don't lock the whole tree
+        url_to_tree: Arc<Mutex<HashMap<String, RadixTree>>>,


Dashmap may helps here

Actually i want to lock on the node of radixtree instead of the whole hashmap. Do you have any suggestion for that?

I think we can start from the simple one (lock whole map). Concurrent map is quite complex. And may try out this comment later on if you are interested
https://users.rust-lang.org/t/locking-only-one-entry-in-hashmap/84764/4

jayzhan211 · 2024-11-08T07:56:58Z

rust/src/router.rs

    },
    Random {
        worker_urls: Vec<String>,
    },
+    ApproxTree {
+        worker_urls: Vec<String>,


If worker_urls is mostly for read only, instead of owned string, can use Arc<str> for thread-safe. Vec<Arc<str>> is probably suitable in this case

For now Vec is read-only, but once we add dynamic scaling, it will be read+write and mostly read. Vec<Arc<str>> looks great

ByronHsu requested a review from Ying1123 as a code owner November 6, 2024 08:01

ByronHsu force-pushed the byhsu/approx-tree branch 2 times, most recently from a9203e3 to 7233496 Compare November 7, 2024 23:23

ByronHsu added 2 commits November 8, 2024 06:32

squeezee

423ecad

faster than --dp wip wip wip

working

523f059

ByronHsu changed the title ~~[WIP] Byhsu/approx tree~~ [rust] Approximate Tree Initial Version Nov 8, 2024

ByronHsu changed the title ~~[rust] Approximate Tree Initial Version~~ [rust] cache-aware DP - approx tree Nov 8, 2024

merrymercy assigned Ying1123 Nov 8, 2024

merrymercy added the high priority label Nov 8, 2024

wip

42ff59e

ByronHsu force-pushed the byhsu/approx-tree branch from efabc82 to 42ff59e Compare November 8, 2024 07:01

fmt

4f1a33b

jayzhan211 reviewed Nov 8, 2024

View reviewed changes

Merge branch 'main' into byhsu/approx-tree

b982ccc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rust] cache-aware DP - approx tree #1934

[rust] cache-aware DP - approx tree #1934

ByronHsu commented Nov 6, 2024 •

edited

Loading

jayzhan211 Nov 8, 2024

ByronHsu Nov 8, 2024

jayzhan211 Nov 8, 2024

jayzhan211 Nov 8, 2024 •

edited

Loading

ByronHsu Nov 8, 2024 •

edited

Loading

[rust] cache-aware DP - approx tree #1934

Are you sure you want to change the base?

[rust] cache-aware DP - approx tree #1934

Conversation

ByronHsu commented Nov 6, 2024 • edited Loading

1. Algorithm Overview

Core Algorithm

Load Balancing Strategies

1. Cache Threshold with Shortest Queue

2. Variance-Based Load Balancing

Parameters

2. Changes

Benchmark

Follow-up

Checklist

jayzhan211 Nov 8, 2024

Choose a reason for hiding this comment

ByronHsu Nov 8, 2024

Choose a reason for hiding this comment

jayzhan211 Nov 8, 2024

Choose a reason for hiding this comment

jayzhan211 Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

ByronHsu Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

ByronHsu commented Nov 6, 2024 •

edited

Loading

jayzhan211 Nov 8, 2024 •

edited

Loading

ByronHsu Nov 8, 2024 •

edited

Loading