uccl-project · YangZhou1997 · Feb 27, 2026 · Feb 27, 2026 · Feb 27, 2026 · Feb 27, 2026
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -15,8 +15,8 @@ Include any tests here.
 - [ ] Manual testing
 
 ## Checklist
-- [ ] My code follows the style guidelines, e.g. `format.sh`.
-- [ ] I have run `build_and_install.sh` to verify compilation.
+- [ ] I have run `format.sh` to follow the style guidelines.
+- [ ] I have run `build.sh` to verify compilation.
 - [ ] I have removed redundant variables and comments.
 - [ ] I have updated the documentation.
 - [ ] I have added tests.
diff --git a/.github/workflows/uccl-build-test.yml b/.github/workflows/uccl-build-test.yml
@@ -75,7 +75,7 @@ jobs:
             conda activate uccl
 
             cd /home/skytestuser/uccl-test
-            ./build_and_install.sh cuda all 3.11 2>&1 | tee build.log
+            ./build.sh cuda all 3.11 --install 2>&1 | tee build.log
 
             grep -q \"Successfully installed uccl-0.0.1.post4\" build.log
           "'
@@ -170,8 +170,8 @@ jobs:
               exit 1
             fi
 
-            if ! python -c 'import torch; import uccl.ep'; then
-              echo 'Import of torch and uccl.ep failed. Cleaning up and exiting...'
+            if ! python -c 'import torch; import uccl_ep'; then
+              echo 'Import of torch and uccl_ep failed. Cleaning up and exiting...'
               python setup.py clean
               exit 1
             fi

diff --git a/README.md b/README.md
@@ -93,12 +93,17 @@ The easiest way to use UCCL is to first build based on your platform. The build
 
 ```bash
 git clone https://github.com/uccl-project/uccl.git --recursive && cd uccl
-bash build_and_install.sh [cuda|rocm|therock] [all|ccl_rdma|ccl_efa|p2p|ep] [py_version] [rocm_index_url]
-# Eg, bash build_and_install.sh cuda ep
+
+# For collective and p2p: eg, bash build.sh cuda ccl_rdma --install
+bash build.sh [cuda|rocm|therock] [all|ccl_rdma|ccl_efa|p2p] [py_version] [rocm_index_url] --install
+
+# For ep: 
+cd ep && bash build.sh [cuda|rocm] [py_version] --install
 ```
 > Note: 
 > - when building for ROCm with python packaging through TheRock, please specify your ROCm index url; the default is `https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu` and it may not be what you want. When installing UCCL wheels for TheRock, please provide pip with the index url and add the optional extra `[rocm]` to the wheel, e.g., `pip install --extra-index-url https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu wheelhouse-therock/uccl-0.0.1.post4-py3-none-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl[rocm]`.
 > - you can build with different CUDA or ROCm versions by specifying tags such as cuda13 or rocm6. The default versions are CUDA 12.x for the "cuda" tag and ROCm 7.x for the "rocm" tag.
+> - check [docs/wheel_build.md](./docs/wheel_build.md) for details.
 
 Then, when running your PyTorch applications, set the environment variable accordingly: 
 ```bash
@@ -127,8 +132,6 @@ export UCCL_HOME=$(pwd)/uccl
 ```
 
 To build UCCL for development, you need to install some common dependencies: 
-<details><summary>Click me</summary>
-
 ```bash
 # Note if you are using docker+wheel build, there is no need to install the following dependencies. 
 sudo apt update
@@ -149,7 +152,6 @@ pip install paramiko pybind11
 # Upgrade conda glic to modern ones
 conda install -c conda-forge "libstdcxx-ng>=12" "libgcc-ng>=12"
 ```
-</details>
 
 For quick installation with docker, you can directly dive into: 
 * [`UCCL-Collective RDMA`](collective/rdma/README.md): Collectives for Nvidia/AMD GPUs + IB/RoCE RDMA NICs (currently support Nvidia and Broadcom NICs)
@@ -160,44 +162,6 @@ For quick installation with docker, you can directly dive into:
 * [`UCCL-P2P`](p2p/README.md): P2P for RDMA NICs and GPU IPCs (currently support Nvidia/AMD GPUs and Nvidia/Broadcom NICs)
 * [`UCCL-EP`](ep/README.md): EP for MoE training and inference with DeepEP-compatible APIs (currently support Nvidia/AMD GPUs and Nvidia/Broadcom/EFA NICs)
 
-### Python Wheel Build
-
-Run the following to build Python wheels: 
-```bash
-cd $UCCL_HOME
-./build.sh [cuda|rocm|therock] [all|rdma|p2p|efa|ep] [py_version] [rocm_index_url]
-```
-
-Run the following to install the wheels locally: 
-```bash
-cd $UCCL_HOME
-pip install wheelhouse-[cuda/rocm]/uccl-*.whl
-```
-
-The cross-compilation matrix is as follows:
-
-| Platform/Feature   | rdma-cuda | rdma-rocm | rdma-arm | p2p-cuda | p2p-rocm | p2p-arm | efa |
-|--------------------|-----------|-----------|----------|----------|----------|---------|-----|
-| cuda + x86         | ✓         | ✓         | x        | ✓        | ✓        | x       | ✓   |
-| cuda + arm (gh200) | ✓         | x         | x        | ✓        | x        | x       | x   |
-| rocm + x86         | ✓         | ✓         | ✓        | ✓        | ✓        | ✓       | x   |
-| aws p4d/p4de       | ✓         | ✓         | x        | ✓        | x        | x       | ✓   |
-
-Note that you need ARM hosts to build ARM wheels, as cross-compilation tool `qemu-user-static` cannot emulate CUDA or ROCm. 
-
-### On Cloudlab CPU Machines
-
-If you want to build nccl and nccl-tests on cloudlab ubuntu22, you need to install cuda and openmpi: 
-
-```bash
-wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
-sudo apt install ./cuda-keyring_1.1-1_all.deb
-sudo apt update
-sudo apt install cuda-toolkit -y
-sudo apt install nvidia-driver-550 nvidia-utils-550 -y
-sudo apt-get install openmpi-bin openmpi-doc libopenmpi-dev -y
-```
-
 </details>
 
 ## Citation
@@ -212,7 +176,7 @@ The code in this repository is mostly described in the papers below. Please cons
 }
 ```
 ```bibtex
-@article{mao2025uccl,
+@article{uccl_ep,
   title={UCCL-EP: Portable Expert-Parallel Communication},
   author={Mao, Ziming and Zhang, Yihan and Cui, Chihan and You, Kaichao and Chen, Zhongjie and Xu, Zhiying and Shenker, Scott and Raiciu, Costin and Zhou, Yang and Stoica, Ion},
   journal={arXiv preprint arXiv:2512.19849},

diff --git a/build.sh b/build.sh
@@ -7,7 +7,7 @@ set -e
 # a purpose-built Docker/Podman image derived from Ubuntu 22.04.
 #
 # Usage:
-#   ./build.sh [cuda|rocm|therock] [all|ccl_rdma|ccl_efa|p2p|ep] [py_version] [rocm_index_url] [therock_base_image]
+#   ./build.sh [cuda|rocm|therock] [all|ccl_rdma|ccl_efa|p2p] [py_version] [rocm_index_url] [therock_base_image] [--install]
 #
 # Environment Variables:
 #   CONTAINER_ENGINE=podman Use podman instead of docker.
@@ -18,9 +18,19 @@ set -e
 # The wheels are written to wheelhouse-[cuda|rocm|therock]
 # -----------------------
 
-TARGET=${1:-cuda}
-BUILD_TYPE=${2:-all}
-PY_VER=${3:-$(python3 -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")}
+# Parse arguments: positional args + --install flag
+DO_INSTALL=0
+POSITIONAL_ARGS=()
+for arg in "$@"; do
+  case "$arg" in
+    --install) DO_INSTALL=1 ;;
+    *) POSITIONAL_ARGS+=("$arg") ;;
+  esac
+done
+
+TARGET=${POSITIONAL_ARGS[0]:-cuda}
+BUILD_TYPE=${POSITIONAL_ARGS[1]:-all}
+PY_VER=${POSITIONAL_ARGS[2]:-$(python3 -c "import sys; print(f'{sys.version_info.major}.{sys.version_info.minor}')")}
 ARCH="$(uname -m)"
 
 # Container engine: "docker" (default) or "podman"
@@ -30,14 +40,14 @@ if [[ "$CONTAINER_ENGINE" != "docker" && "$CONTAINER_ENGINE" != "podman" ]]; the
   exit 1
 fi
 # The default for ROCM_IDX_URL depends on the gfx architecture of your GPU and the index URLs may change.
-ROCM_IDX_URL=${4:-https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu}
+ROCM_IDX_URL=${POSITIONAL_ARGS[3]:-https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu}
 # The default for THEROCK_BASE_IMAGE is current, but may change. Make sure to track TheRock's dockerfile.
-THEROCK_BASE_IMAGE=${5:-quay.io/pypa/manylinux_2_28_x86_64@sha256:d632b5e68ab39e59e128dcf0e59e438b26f122d7f2d45f3eea69ffd2877ab017}
+THEROCK_BASE_IMAGE=${POSITIONAL_ARGS[4]:-quay.io/pypa/manylinux_2_28_x86_64@sha256:d632b5e68ab39e59e128dcf0e59e438b26f122d7f2d45f3eea69ffd2877ab017}
 IS_EFA=$( [ -d "/sys/class/infiniband/" ] && ls /sys/class/infiniband/ 2>/dev/null | grep -q rdmap && echo "EFA support: true" ) || echo "EFA support: false"
 
 
 if [[ $TARGET != cuda* && $TARGET != rocm* && $TARGET != "therock" ]]; then
-  echo "Usage: $0 [cuda|rocm|therock] [all|ccl_rdma|ccl_efa|p2p|ep] [py_version] [rocm_index_url] [therock_base_image]" >&2
+  echo "Usage: $0 [cuda|rocm|therock] [all|ccl_rdma|ccl_efa|p2p] [py_version] [rocm_index_url] [therock_base_image] [--install]" >&2
   exit 1
 fi
 
@@ -175,32 +185,6 @@ build_p2p() {
   fi
 }
 
-build_ep() {
-  local TARGET="$1"
-  local ARCH="$2"
-  local IS_EFA="$3"
-
-  set -euo pipefail
-  echo "[container] build_ep Target: $TARGET"
-
-  if [[ "${USE_INTEL_RDMA_NIC:-0}" == "1" ]]; then
-    echo "[container] Building EP with Intel RDMA NIC support (USE_INTEL_RDMA_NIC=1)"
-  fi
-
-  if [[ "$TARGET" == "therock" ]]; then
-    echo "Skipping GPU-driven build on therock (no GPU-driven support yet)."
-  elif [[ "$TARGET" == rocm* || "$TARGET" == cuda* ]]; then
-    cd ep
-    # This may be needed if you traverse through different git commits
-    # make clean && rm -r build || true
-    USE_INTEL_RDMA_NIC=${USE_INTEL_RDMA_NIC:-0} python3 setup.py build
-    cd ..
-    echo "[container] Copying GPU-driven .so to uccl/"
-    mkdir -p uccl/lib
-    cp ep/build/**/*.so uccl/
-  fi
-}
-
 build_ukernel() {
   local TARGET="$1"
   local ARCH="$2"
@@ -303,7 +287,7 @@ echo "[2/3] Running build inside container..."
 
 # Auto-detect CUDA architecture for ep build
 DETECTED_GPU_ARCH=""
-if [[ "$BUILD_TYPE" =~ (ep|all|p2p) ]];then
+if [[ "$BUILD_TYPE" =~ (all|p2p) ]];then
   if [[ "$TARGET" == cuda* ]] && command -v nvidia-smi &> /dev/null; then
     DETECTED_GPU_ARCH="$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader 2>/dev/null | head -n1 | tr -d ' ' || true)"
 
@@ -387,7 +371,7 @@ ${CONTAINER_ENGINE} "${CONTAINER_RUN_ARGS[@]}" \
   -e TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST:-}" \
   -e DISABLE_AGGRESSIVE_ATOMIC="${DISABLE_AGGRESSIVE_ATOMIC:-0}" \
   -e UCCL_WHEEL_PLAT="${UCCL_WHEEL_PLAT:-}" \
-  -e FUNCTION_DEF="$(declare -f build_rccl_nccl_h build_ccl_rdma build_ccl_efa build_p2p build_ep build_ukernel)" \
+  -e FUNCTION_DEF="$(declare -f build_rccl_nccl_h build_ccl_rdma build_ccl_efa build_p2p build_ukernel)" \
   -w /io \
   "$IMAGE_NAME" /bin/bash -c '
     set -euo pipefail
@@ -417,15 +401,12 @@ ${CONTAINER_ENGINE} "${CONTAINER_RUN_ARGS[@]}" \
       build_ccl_efa "$TARGET" "$ARCH" "$IS_EFA"
     elif [[ "$BUILD_TYPE" == "p2p" ]]; then
       build_p2p "$TARGET" "$ARCH" "$IS_EFA"
-    elif [[ "$BUILD_TYPE" == "ep" ]]; then
-      build_ep "$TARGET" "$ARCH" "$IS_EFA"
     elif [[ "$BUILD_TYPE" == "ukernel" ]]; then
       build_ukernel "$TARGET" "$ARCH" "$IS_EFA"
     elif [[ "$BUILD_TYPE" == "all" ]]; then
       build_ccl_rdma "$TARGET" "$ARCH" "$IS_EFA"
       build_ccl_efa "$TARGET" "$ARCH" "$IS_EFA"
       build_p2p "$TARGET" "$ARCH" "$IS_EFA"
-      # build_ep "$TARGET" "$ARCH" "$IS_EFA"
       # build_ukernel "$TARGET" "$ARCH" "$IS_EFA"
     fi
 
@@ -527,3 +508,34 @@ def initialize():
 # 3. Done
 echo "[3/3] Wheel built successfully (stored in ${WHEEL_DIR}):"
 ls -lh "${WHEEL_DIR}"/uccl-*.whl || true
+
+# 4. Optionally install the built wheel
+if [[ "$DO_INSTALL" == "1" ]]; then
+  # Auto-detect uv vs pip
+  if command -v uv &> /dev/null && [[ -n "${VIRTUAL_ENV:-}" ]]; then
+    PIP_CMD="uv pip"
+  else
+    PIP_CMD="pip"
+  fi
+  echo "[4/4] Installing uccl wheel (using ${PIP_CMD})..."
+  ${PIP_CMD} install -r requirements.txt 2>/dev/null || true
+  ${PIP_CMD} uninstall uccl -y 2>/dev/null || true
+  if [[ "$TARGET" != "therock" ]]; then
+    ${PIP_CMD} install "${WHEEL_DIR}"/uccl-*.whl --no-deps
+  else
+    ${PIP_CMD} install --extra-index-url "${ROCM_IDX_URL}" "$(ls "${WHEEL_DIR}"/uccl-*.whl)[rocm]"
+  fi
+
+  UCCL_INSTALL_PATH=$(${PIP_CMD} show uccl 2>/dev/null | grep "^Location:" | cut -d' ' -f2 || echo "")
+  if [[ -n "$UCCL_INSTALL_PATH" && -d "$UCCL_INSTALL_PATH" ]]; then
+    UCCL_PACKAGE_PATH="$UCCL_INSTALL_PATH/uccl"
+    if [[ -d "$UCCL_PACKAGE_PATH" ]]; then
+      echo "UCCL installed at: $UCCL_PACKAGE_PATH"
+      echo "Set LIBRARY_PATH: export LIBRARY_PATH=\"$UCCL_PACKAGE_PATH/lib:\$LIBRARY_PATH\""
+    else
+      echo "UCCL package directory not found at: $UCCL_PACKAGE_PATH"
+    fi
+  else
+    echo "Warning: Could not detect UCCL installation path"
+  fi
+fi
diff --git a/build_and_install.sh b/build_and_install.sh
diff --git a/collective/efa/README.md b/collective/efa/README.md
@@ -44,7 +44,7 @@ make MPI=1 MPI_HOME=/opt/amazon/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=$UCC
 
 The easiest way is to use docker, which packs all needed external libraries into a python wheel and install into your local python env: 
 ```bash
-cd $UCCL_HOME && bash build_and_install.sh cuda efa
+cd $UCCL_HOME && bash build.sh cuda efa --install
 ```
 
 The following alternative is best for development where you have installed all needed external libraries: 

diff --git a/collective/rdma/README.md b/collective/rdma/README.md
@@ -58,7 +58,7 @@ make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi CUDA_HOME=/usr/local/cuda
 
 The easiest way is to use docker, which packs all needed external libraries into a python wheel and install into your local python env: 
 ```bash
-cd $UCCL_HOME && bash build_and_install.sh cuda rdma
+cd $UCCL_HOME && bash build.sh cuda rdma --install
 ```
 
 The following alternative is best for development where you have installed all needed external libraries: 
@@ -107,7 +107,7 @@ make MPI=1 MPI_HOME=/opt/ohpc/pub/mpi/openmpi4-gnu12/4.1.5 HIP_HOME=/opt/rocm-6.
 
 The easiest way is to use docker: 
 ```bash
-cd $UCCL_HOME && bash build_and_install.sh rocm rdma
+cd $UCCL_HOME && bash build.sh rocm rdma --install
 ```
 
 The following alternative is best for development where you have installed all needed external libraries: