-
Notifications
You must be signed in to change notification settings - Fork 617
support context parallel #3951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
irexyc
wants to merge
31
commits into
InternLM:main
Choose a base branch
from
irexyc:cp
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
support context parallel #3951
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
c1dae3a
use driver flag
irexyc bb27b62
update
irexyc 0fe88bc
accurate mask iter
irexyc 5c02779
use fast divmod
irexyc 53654ad
remove cp_O
irexyc e3dd4f7
remove unused
irexyc 1f75dd6
return the last token's logprobs if include_stop_str_in_output is req…
lvhan028 be504d3
[Fix] device args in chat cli when using pytorch engine (#3999)
CyCle1024 25a8fb8
Merge remote-tracking branch 'origin/main' into cp2
irexyc 77ef52a
fix NULL raw data
irexyc 29cf813
add attn_cp_size to cli
irexyc 0044d4f
build cutlass::FastDivmod on host
irexyc e4050a4
use single buffer
irexyc f44ef96
udpate comm
irexyc a329b29
use two stage reduce
irexyc dafcd64
Merge remote-tracking branch 'github/main' into cp2
irexyc c9649c0
remove unused
irexyc 52766d2
better AllreduceResidualRMSnorm
irexyc b783d5c
fix max_session_len
irexyc c39373a
Merge remote-tracking branch 'github/main' into cp
irexyc 47a349b
update docs
irexyc d83a2c7
fix embedding/lm_head split
irexyc c7e1e23
use same split_k on different cp_rank
irexyc 8c5b289
always use seperate reduce for cp
irexyc 4005547
add cp configuration parameter
irexyc 1d2b098
remove redundant parameters
irexyc 77920f8
remove redundant parameters
irexyc f54ca43
fix build
irexyc 1ac3080
fix xgrammar build
irexyc 7872225
update docs
irexyc 0f82ef1
remove unused
irexyc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # Context Parallel | ||
|
|
||
| When the memory on a single GPU is insufficient to deploy a model, it is often deployed using tensor parallelism (TP), which generally requires `num_key_value_heads` to be divisible by `TP`. If you want to deploy with `TP > num_key_value_heads`, the kv-heads should be duplicated to meet the divisibility requirement. However, this has two disadvantages: | ||
|
|
||
| 1. The amount of available kv_cache is halved, which reducing the maximum supported session length. | ||
| 2. The maximum inference batch size is reduced, leading to lower throughput. | ||
|
|
||
| To address this issue, the TurboMind inference backend supports setting `attn_dp_size`, which avoids creating copies of kv-heads, but this introduces data imbalance. To eliminate data imbalance, TurboMind supports sequence parallelism, which allowing kv_cache to be stored interleaved on different cp_ranks. See the example below: | ||
|
|
||
| ``` | ||
| cp_rank=2, prompt_len=5, generation_len=4 | ||
| kv_cache stored on cp_rank0: 0, 2, 4, 6, 8 | ||
| kv_cache stored on cp_rank1: 1, 3, 5, 7 | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| Taking Intern-S1 / Qwen3-235B-A22B as an example, their `num_key_value_heads` is 4. If you want to deploy with `TP=8` and avoid duplication of kv_cache, you can deploy in the following way: | ||
|
|
||
| ``` | ||
| lmdeploy serve api_server internlm/Intern-S1 --tp 8 --cp 2 | ||
|
|
||
| lmdeploy serve api_server Qwen/Qwen3-235B-A22B --tp 8 --cp 2 | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # 序列并行 | ||
|
|
||
| 在单卡显存不足以部署模型的时候,通常会以 `TP` 的方式进行部署,而这一般要求 `num_key_value_heads` 被 `TP` 整除。如果要以 `TP > num_key_value_heads` 的方式进行部署,需要创建 kv-heads 的副本,以满足整除需求。但是这样会有两个缺点: | ||
|
|
||
| 1. 可用的 kvcache 数量减半,进而减少请求最大推理长度 | ||
| 2. 降低推理的最大 batch 数量,减少吞吐量。 | ||
|
|
||
| 为了解决这个问题,TurboMind 推理后端支持设置 `attn_dp_size`,避免了创建 kv-heads 的副本,但是这会引入数据的不均衡性。为了消除数据的不均衡,TurboMind 支持了序列并行,支持将 kv_cache 交错存储到不同的 cp_rank 上,例如 | ||
|
|
||
| ``` | ||
| cp_rank=2, prompt_len=5, generation_len=4 | ||
| kv_cache stored on cp_rank0: 0, 2, 4, 6, 8 | ||
| kv_cache stored on cp_rank1: 1, 3, 5, 7 | ||
| ``` | ||
|
|
||
| ## 使用说明 | ||
|
|
||
| 以 `Intern-S1` / `Qwen3-235B-A22B` 为例,他们的 `num_key_value_heads` 为 4,若要用 `TP=8` 的方式部署,并避免 kv_cache 的拷贝,可以用如下的方式部署 | ||
|
|
||
| ``` | ||
| lmdeploy serve api_server internlm/Intern-S1 --tp 8 --cp 2 | ||
|
|
||
| lmdeploy serve api_server Qwen/Qwen3-235B-A22B --tp 8 --cp 2 | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.