deepseek-ai · jershi425 · Dec 3, 2025 · Nov 18, 2025 · Nov 24, 2025 · Nov 24, 2025
diff --git a/Hybrid-EP_Implementation.md b/Hybrid-EP_Implementation.md
@@ -192,23 +192,51 @@ export RDMA_CORE_HOME=/path/to/rdma-core  # Path to your RDMA core installation
 export TORCH_ARCH_LIST="9.0;10.0"  # Adjust based on your GPU architecture
 pip install .
 ```
+
+> RDMA Core requirement: install `rdma-core` v60.0 ([reference](https://github.com/linux-rdma/rdma-core/tree/v60.0)), and the latest release is also recommended ([linux-rdma/rdma-core](https://github.com/linux-rdma/rdma-core.git)).
 
+Example:
+```bash
+git clone https://github.com/linux-rdma/rdma-core.git
+cd rdma-core
+git checkout tags/v60.0
+sh build.sh
+export RDMA_CORE_HOME=/path/to/rdma-core/build
+```
 
-### Quick Start
+Hybrid EP’s RDMA topology probing relies on `libnvidia-ml.so.1`. During Dockerfile builds, compile against the NVML stubs (for example, those shipped in `libnvidia-ml-dev`), then at runtime launch the container with `--gpus all` or a Kubernetes device plugin so that the NVIDIA container runtime injects the host’s real NVML library and prevents driver/library mismatches.
 
-> **⚠️ Important Note for RDMA Inter-node Configuration**  
-> Currently, the RDMA inter-node kernel implementation requires manual specification of nic names for each GPU. You need to provide the mapping between GPUs and their corresponding IB device names via the `--ib-dev-name-list` parameter. See `tests/test_hybrid_ep.py` for detailed usage examples.
->  In addition, when using the RDMA part, after setting num-tokens-per-rank during initialization, all subsequent communications must use the same value. Currently, dynamic sequence length is not supported.
->
-> **Automatic topology detection will be supported soon.**
-> **Dynamic sequence length will be supported soon.**
+Example:
+```bash
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends libnvidia-ml-dev
+RUN git clone -b hybrid_ep https://github.com/deepseek-ai/DeepEP.git
+ENV HYBRID_EP_MULTINODE=1
+RUN cd DeepEP && \
+    TORCH_CUDA_ARCH_LIST="9.0 10.0" MAX_JOBS=8 pip install --no-build-isolation . && \
+    apt-get purge -y libnvidia-ml-dev && \
+    apt-get autoremove -y && \
+    rm -rf /var/lib/apt/lists/*
+```
+
+### Quick Start
 
 Refer to `tests/test_hybrid_ep.py` for comprehensive usage examples including:
 - Multi-node configuration
 - Intra-node testing scenarios
 - Inter-node testing scenarios
 - Performance benchmarking setups
 
+**Explicitly configure `num_of_hybrid_ep_ranks_per_nvlink_domain` (default 8, representing the number of Hybrid-EP ranks that participate in the same Hybrid-EP communication within a single NVLink domain, this value is critical for MNNVL case) and `USE_MNNVL` (default disabled/False) either via uppercase environment variables or by passing arguments to `HybridEPBuffer.__init__`. In multi-node NVLink deployments you must enable `USE_MNNVL=1`.**
+
+Example configuration on EP64, MNNVL:
+- Environment variables:
+  ```
+  export NUM_OF_HYBRID_EP_RANKS_PER_NVLINK_DOMAIN=64
+  export USE_MNNVL=1
+  ```
+- Python init: `HybridEPBuffer(..., num_of_hybrid_ep_ranks_per_nvlink_domain=64, use_mnnvl=True)`
+
 ### Important Configuration Note
 Here are important parameter settings in `csrc/hybrid_ep/config.cuh`. You can modify these parameters via `HybridEPBuffer.init_config()` or by setting proper environment variables (see `deep_ep/hybrid_ep_buffer.py`) to achieve better performance/usability:
 
@@ -264,7 +292,6 @@ Here are important parameter settings in `csrc/hybrid_ep/config.cuh`. You can mo
 - Comprehensive performance improvements
 
 ### 🚧 Upcoming Features
-- **Automatic Topology Detection**: Automatic detection of GPU-NIC mapping for RDMA inter-node communication, eliminating the need for manual `--ib-dev-name-list` configuration
 - **Low Latency Mode**: Enhanced performance for latency-critical workloads
 - Performance optimization
 

diff --git a/csrc/hybrid_ep/backend/NCCL_LICENSE.txt b/csrc/hybrid_ep/backend/NCCL_LICENSE.txt
@@ -0,0 +1,38 @@
+ Copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.
+
+ Redistribution and use in source and binary forms, with or without
+ modification, are permitted provided that the following conditions
+ are met:
+  * Redistributions of source code must retain the above copyright
+    notice, this list of conditions and the following disclaimer.
+  * Redistributions in binary form must reproduce the above copyright
+    notice, this list of conditions and the following disclaimer in the
+    documentation and/or other materials provided with the distribution.
+  * Neither the name of NVIDIA CORPORATION, Lawrence Berkeley National
+    Laboratory, the U.S. Department of Energy, nor the names of their
+    contributors may be used to endorse or promote products derived
+    from this software without specific prior written permission.
+
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+ EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+ The U.S. Department of Energy funded the development of this software
+ under subcontract 7078610 with Lawrence Berkeley National Laboratory.
+
+
+This code also includes files from the NVIDIA Tools Extension SDK project.
+
+See:
+
+   https://github.com/NVIDIA/NVTX
+
+for more information and license details.