Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rust] cache-aware DP - approx tree #1934

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ByronHsu
Copy link
Collaborator

@ByronHsu ByronHsu commented Nov 6, 2024

1. Algorithm Overview

#1732

We propose a method to approximate worker-side radix trees on the router side using request and response information flow. The router maintains approximated trees ("approx trees") that mirror the cache state of each worker's radix tree ("worker tree").

Core Algorithm

Given N workers, the algorithm operates as follows:

  1. Initialize N approx trees on the router, one corresponding to each worker
  2. For each incoming request:
    • Match request's input_ids against all approx trees
    • Select the worker with highest cache match
    • Speculation Phase: Before forwarding the request, insert input_ids into the selected approx tree
    • Correction Phase: Upon receiving the worker's response, remove non-cached tokens from the approx tree

The speculation phase anticipates the worker tree's future state, while the correction phase aligns the approx tree with the worker's actual cache state. This forward-backward mechanism enables continuous self-adjustment of the approximation.

Load Balancing Strategies

1. Cache Threshold with Shortest Queue

  1. Maintain N approx trees on the router
  2. For each request:
    • Calculate cache match rates across all workers
    • If highest match rate > threshold: Select highest-matching worker
    • Else: Select worker with shortest request queue
    • Apply speculation and correction phases

2. Variance-Based Load Balancing

  1. Maintain N approx trees on the router
  2. For each request:
    • Calculate mean (μ) and standard deviation (σ) of queue lengths
    • Filter out workers with queue length > μ + Kσ
    • Among remaining workers, select highest cache match
    • Apply speculation and correction phases

Parameters

  • K: variance threshold parameter
  • μ: mean queue length
  • σ: standard deviation of queue lengths

See details of google doc version here

2. Changes

  1. Implemented the core algorithm and "Cache Threshold with Shortest Queue" LB strategy in rust
  2. Rust python binding for approx tree
  3. Rust cli for approx tree
  4. A benchmark file which simulates the case of long system prompt + relatively shorter QA.
  5. Demo python binding usage under py_src
  • main.py: launch a minimal router server with existing worker
  • dp_demo.py: mimic the case like the current --dp. Launch --dp workers and a router. This part of code can be moved to sglang core afterwards and let sglang depends on sglang-router. Users can install sglang-router as an optional to sglang.

Benchmark

# original
python -m sglang.launch_server --model-path <model path> --host 127.0.0.1 --port 30000 --dp 8

# approx tree
python dp_demo.py --model-path <model path> --local-tokenizer-path <local tokenizer.json path> --dp 8

# benchmark client
python long_prompt_multi_turn.py --port 30000  --tokenizer <tokenizer path>

We can see clear improvement from the original method due to the high cache hitting.

image

image image

Reference: google sheet

Follow-up

There are still many follow-ups. Notably

  1. Python binding CI publishing setup (sglang-router pypi)
  2. Various optimization in rust (marked as TODO in the code`)
  3. Implementation of the correction phase and Variance-Based Load Balancing

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@ByronHsu ByronHsu force-pushed the byhsu/approx-tree branch 2 times, most recently from a9203e3 to 7233496 Compare November 7, 2024 23:23
faster than --dp

wip

wip

wip
@ByronHsu ByronHsu changed the title [WIP] Byhsu/approx tree [rust] Approximate Tree Initial Version Nov 8, 2024
@ByronHsu ByronHsu changed the title [rust] Approximate Tree Initial Version [rust] cache-aware DP - approx tree Nov 8, 2024
ApproxTree {
worker_urls: Vec<String>,
// TODO: don't lock the whole tree
url_to_tree: Arc<Mutex<HashMap<String, RadixTree>>>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashmap may helps here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i want to lock on the node of radixtree instead of the whole hashmap. Do you have any suggestion for that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can start from the simple one (lock whole map). Concurrent map is quite complex. And may try out this comment later on if you are interested
https://users.rust-lang.org/t/locking-only-one-entry-in-hashmap/84764/4

},
Random {
worker_urls: Vec<String>,
},
ApproxTree {
worker_urls: Vec<String>,
Copy link

@jayzhan211 jayzhan211 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If worker_urls is mostly for read only, instead of owned string, can use Arc<str> for thread-safe. Vec<Arc<str>> is probably suitable in this case

Copy link
Collaborator Author

@ByronHsu ByronHsu Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now Vec is read-only, but once we add dynamic scaling, it will be read+write and mostly read. Vec<Arc<str>> looks great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants