Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
151f27e
us sm copy flags
Autumn1998 Nov 18, 2025
4d4cc12
Use ~/.deepep/hybrid_ep/jit as the jit path
Nov 24, 2025
17313f8
[BIG CHANGE] refactor permute, group 2 syncs into 1
Autumn1998 Nov 24, 2025
c594d90
Add topo detection
Autumn1998 Nov 24, 2025
c77e40c
[TO FIX] Update hybrid ep version
Autumn1998 Nov 24, 2025
7541f65
fix bug in scaling factor
Autumn1998 Nov 24, 2025
25da0ca
Jit different kernel in inter-node/intra-node case
Autumn1998 Nov 25, 2025
b3359fd
[TO FIX] dynamic seq len for RDMA
Autumn1998 Nov 25, 2025
3fd405d
disable RDMA dynamic token temporary
Autumn1998 Nov 25, 2025
4ea5679
move sync to the point before dispatch
Autumn1998 Nov 25, 2025
d227353
Fix the bug: Use MAX_SEQ_LEN to jit compile kerne
Autumn1998 Nov 26, 2025
ff09d91
Make permute a persistant kernel
Autumn1998 Nov 26, 2025
0699795
add lost file of roca support
Autumn1998 Nov 26, 2025
3211a59
Add handle on the torch API test
Autumn1998 Nov 26, 2025
ef88f80
Release of assert of hidden%512==0 at bf16 case
Autumn1998 Nov 26, 2025
87c2414
Fix bug of hang when: seq_len < chunk_size * num_of_SMs
Autumn1998 Nov 27, 2025
a8c98b7
fix doc
Autumn1998 Nov 27, 2025
699608a
use 8 SM on RDMA
Autumn1998 Nov 27, 2025
da5bbc9
minor fix
Autumn1998 Nov 27, 2025
406399d
fix bug on updating config.max_num_of_tokens_per_rank
Autumn1998 Nov 27, 2025
098da2a
rm num_dispatch_tokens
Autumn1998 Nov 27, 2025
1d38eb0
minor fix on test
Autumn1998 Nov 27, 2025
0848ae0
update name of flag
Autumn1998 Nov 28, 2025
489c80b
update doc
Autumn1998 Nov 28, 2025
99d7e5e
update doc
Autumn1998 Nov 28, 2025
f7adadf
updata dog, fix bug on RDMA install
Autumn1998 Nov 28, 2025
40f3148
fix review comments
Autumn1998 Nov 28, 2025
c86166e
use more blocks in permute
Autumn1998 Nov 28, 2025
7593620
optimize permute kernel
Autumn1998 Dec 1, 2025
27f5947
update assertion message
Autumn1998 Dec 1, 2025
8511e51
reduce default stages to fit EP64
Autumn1998 Dec 1, 2025
ab95a84
enable rdma on mnnvl
Autumn1998 Dec 2, 2025
7c0a35f
rm the limition of hidden%256 == 0 in bf16 case
Autumn1998 Dec 2, 2025
5ad9d2f
revert topk in test
Autumn1998 Dec 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 35 additions & 8 deletions Hybrid-EP_Implementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,23 +192,51 @@ export RDMA_CORE_HOME=/path/to/rdma-core # Path to your RDMA core installation
export TORCH_ARCH_LIST="9.0;10.0" # Adjust based on your GPU architecture
pip install .
```

> RDMA Core requirement: install `rdma-core` v60.0 ([reference](https://github.com/linux-rdma/rdma-core/tree/v60.0)), and the latest release is also recommended ([linux-rdma/rdma-core](https://github.com/linux-rdma/rdma-core.git)).

Example:
```bash
git clone https://github.com/linux-rdma/rdma-core.git
cd rdma-core
git checkout tags/v60.0
sh build.sh
export RDMA_CORE_HOME=/path/to/rdma-core/build
```

### Quick Start
Hybrid EP’s RDMA topology probing relies on `libnvidia-ml.so.1`. During Dockerfile builds, compile against the NVML stubs (for example, those shipped in `libnvidia-ml-dev`), then at runtime launch the container with `--gpus all` or a Kubernetes device plugin so that the NVIDIA container runtime injects the host’s real NVML library and prevents driver/library mismatches.

> **⚠️ Important Note for RDMA Inter-node Configuration**
> Currently, the RDMA inter-node kernel implementation requires manual specification of nic names for each GPU. You need to provide the mapping between GPUs and their corresponding IB device names via the `--ib-dev-name-list` parameter. See `tests/test_hybrid_ep.py` for detailed usage examples.
> In addition, when using the RDMA part, after setting num-tokens-per-rank during initialization, all subsequent communications must use the same value. Currently, dynamic sequence length is not supported.
>
> **Automatic topology detection will be supported soon.**
> **Dynamic sequence length will be supported soon.**
Example:
```bash
RUN apt-get update && \
apt-get install -y --no-install-recommends libnvidia-ml-dev
RUN git clone -b hybrid_ep https://github.com/deepseek-ai/DeepEP.git
ENV HYBRID_EP_MULTINODE=1
RUN cd DeepEP && \
TORCH_CUDA_ARCH_LIST="9.0 10.0" MAX_JOBS=8 pip install --no-build-isolation . && \
apt-get purge -y libnvidia-ml-dev && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*
```

### Quick Start

Refer to `tests/test_hybrid_ep.py` for comprehensive usage examples including:
- Multi-node configuration
- Intra-node testing scenarios
- Inter-node testing scenarios
- Performance benchmarking setups

**Explicitly configure `num_of_hybrid_ep_ranks_per_nvlink_domain` (default 8, representing the number of Hybrid-EP ranks that participate in the same Hybrid-EP communication within a single NVLink domain, this value is critical for MNNVL case) and `USE_MNNVL` (default disabled/False) either via uppercase environment variables or by passing arguments to `HybridEPBuffer.__init__`. In multi-node NVLink deployments you must enable `USE_MNNVL=1`.**

Example configuration on EP64, MNNVL:
- Environment variables:
```
export NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAIN=64
export USE_MNNVL=1
```
- Python init: `HybridEPBuffer(..., num_of_hybrid_ep_ranks_per_nvlink_domain=64, use_mnnvl=True)`

### Important Configuration Note
Here are important parameter settings in `csrc/hybrid_ep/config.cuh`. You can modify these parameters via `HybridEPBuffer.init_config()` or by setting proper environment variables (see `deep_ep/hybrid_ep_buffer.py`) to achieve better performance/usability:

Expand Down Expand Up @@ -264,7 +292,6 @@ Here are important parameter settings in `csrc/hybrid_ep/config.cuh`. You can mo
- Comprehensive performance improvements

### 🚧 Upcoming Features
- **Automatic Topology Detection**: Automatic detection of GPU-NIC mapping for RDMA inter-node communication, eliminating the need for manual `--ib-dev-name-list` configuration
- **Low Latency Mode**: Enhanced performance for latency-critical workloads
- Performance optimization

Expand Down
38 changes: 38 additions & 0 deletions csrc/hybrid_ep/backend/NCCL_LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
Copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION, Lawrence Berkeley National
Laboratory, the U.S. Department of Energy, nor the names of their
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The U.S. Department of Energy funded the development of this software
under subcontract 7078610 with Lawrence Berkeley National Laboratory.


This code also includes files from the NVIDIA Tools Extension SDK project.

See:

https://github.com/NVIDIA/NVTX

for more information and license details.
Loading