Skip to content

m#50

Open
chenncy wants to merge 10 commits intoInfiniTensor:mainfrom
chenncy:main
Open

m#50
chenncy wants to merge 10 commits intoInfiniTensor:mainfrom
chenncy:main

Conversation

@chenncy
Copy link

@chenncy chenncy commented Mar 16, 2026

No description provided.

Copilot AI review requested due to automatic review settings March 16, 2026 16:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the build, runtime, and Python bindings to support NVIDIA CUDA execution (including device runtime + CUDA ops), adds optional NCCL-based tensor-parallel support, and introduces several diagnostic/benchmark scripts and tests while renaming the Python package to llaisys_py.

Changes:

  • Add CUDA/NVIDIA build targets, device runtime implementation, and GPU dispatch for multiple ops (plus CPU fallbacks where needed).
  • Add optional NCCL communication layer and Python bindings/tests for tensor-parallel inference (Project #5).
  • Rename/standardize Python package import path to llaisys_py, update tests, and add server/diagnostic tooling.

Reviewed changes

Copilot reviewed 88 out of 132 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
xmake/nvidia.lua Adds CUDA static targets (device + ops) with devlink and gencodes.
xmake/cpu.lua Adjusts CPU targets’ warnings/flags handling to rely on global -fPIC.
xmake.lua Global build policy/flags, optional NCCL target, OpenMP/AVX flags, updated install copy logic.
test/test_tensor.py Updates tests to import llaisys_py.
test/test_tensor_parallel.py Adds tensor-parallel (NCCL) multi-process test harness.
test/test_runtime.py Updates tests to import llaisys_py.
test/test_multi_user_chat.py Adds concurrent HTTP chat request test script.
test/test_kv_cache.py Adds KV cache export/import + suffix prefill test script.
test/test_batch_correctness.py Adds sequential vs Engine batched-output correctness script.
test/ops/swiglu.py Updates op test to import llaisys_py.
test/ops/self_attention.py Updates op test to import llaisys_py.
test/ops/sample.py Adds sampling op test script (CPU-oriented).
test/ops/rope.py Updates op test to import llaisys_py.
test/ops/rms_norm.py Updates op test to import llaisys_py.
test/ops/linear.py Updates op test to import llaisys_py.
test/ops/linear_bench.py Adds reproducible linear benchmark runner.
test/ops/linear_bench_report.py Adds JSON benchmark comparison/report script.
test/ops/embedding.py Updates op test to import llaisys_py.
test/ops/argmax.py Updates op test to import llaisys_py.
test/ops/add.py Updates op test to import llaisys_py.
test/minimal_engine_test.py Adds minimal tokenizer+engine reproduction script.
test/diagnose_gpu_layer.py Adds layer-by-layer GPU correctness diagnostic script.
src/utils/types.cpp Formatting-only change (no behavioral change).
src/utils/check.hpp Formatting-only change (macros unchanged).
src/utils.hpp Formatting-only change.
src/tensor/tensor.hpp Formatting-only change.
src/ops/swiglu/op.hpp Formatting-only change.
src/ops/swiglu/op.cpp Implements SwiGLU (CPU + NVIDIA dispatch).
src/ops/self_attention/op.hpp Formatting-only change.
src/ops/sample/op.hpp Adds sampling op API declaration + doc.
src/ops/rope/op.hpp Formatting-only change.
src/ops/rope/op.cpp Implements RoPE (CPU + NVIDIA dispatch).
src/ops/rms_norm/op.hpp Formatting-only change.
src/ops/rms_norm/op.cpp Implements RMSNorm (CPU + NVIDIA dispatch).
src/ops/rearrange/op.hpp Formatting-only change.
src/ops/rearrange/op.cpp Formatting-only change (still TODO).
src/ops/linear/op.hpp Formatting-only change.
src/ops/embedding/op.hpp Formatting-only change.
src/ops/embedding/op.cpp Implements embedding (CPU + NVIDIA dispatch).
src/ops/argmax/op.hpp Formatting-only change.
src/ops/argmax/op.cpp Implements argmax (CPU + NVIDIA dispatch).
src/ops/add/op.hpp Adds docs to Add op interface.
src/ops/add/op.cpp Adds NVIDIA dispatch and context switching for Add.
src/ops/add/cpu/add_cpu.hpp Adds docs to CPU Add implementation interface.
src/ops/add/cpu/add_cpu.cpp Reworks CPU Add implementation and adds extensive inline commentary.
src/llaisys/tensor.cc Switches to LLAISYS_EXTERN_C wrapper macro.
src/llaisys/runtime.cc Switches to LLAISYS_EXTERN_C wrapper macro.
src/llaisys/ops.cc Adds sample op export; allows null bias for linear; switches extern wrapper macro.
src/llaisys/nccl_comm.cu Adds NCCL implementation (guarded by ENABLE_NCCL + ENABLE_NVIDIA_API).
src/llaisys/nccl_comm_stub.cc Adds NCCL stub symbols when NCCL isn’t enabled.
src/llaisys/llaisys_tensor.hpp Switches to LLAISYS_EXTERN_C wrapper macro.
src/device/runtime_api.hpp Formatting-only change.
src/device/runtime_api.cpp Formatting-only change.
src/device/nvidia/nvidia_runtime_api.cu Implements CUDA runtime API backend (device/mem/stream/memcpy).
src/device/nvidia/nvidia_resource.cuh Formatting-only change.
src/device/nvidia/nvidia_resource.cu Formatting-only change.
src/device/device_resource.hpp Formatting-only change.
src/device/cpu/cpu_runtime_api.cpp Formatting-only change.
src/device/cpu/cpu_resource.hpp Formatting-only change.
src/device/cpu/cpu_resource.cpp Formatting-only change.
src/core/storage/storage.hpp Formatting-only change.
src/core/storage/storage.cpp Formatting-only change.
src/core/runtime/runtime.hpp Adds shutdown-deactivation flag/API.
src/core/runtime/runtime.cpp Implements shutdown-deactivation logic in Runtime lifecycle.
src/core/llaisys_core.hpp Formatting-only change.
src/core/core.hpp Formatting-only change.
src/core/context/context.hpp Formatting-only change.
src/core/context/context.cpp Introduces global runtime pool to share CUDA context across threads.
src/core/allocator/naive_allocator.hpp Formatting-only change.
src/core/allocator/naive_allocator.cpp Formatting-only change.
src/core/allocator/allocator.hpp Formatting-only change.
scripts/run_server.sh Adds helper script to run server with PYTHONPATH.
scripts/list_safetensors_keys.py Adds safetensors metadata inspection script.
scripts/download_model.py Adds Hugging Face model download helper.
python/setup.cfg Renames package to llaisys-py, updates package data section.
python/pyproject.toml Formatting-only change.
python/llaisys/models/qwen2.py Removes old (stub) llaisys package model code.
python/llaisys_py/tensor.py Formatting-only change.
python/llaisys_py/server/README.md Adds server usage docs.
python/llaisys_py/server/chat_cli.py Adds CLI client for the server.
python/llaisys_py/server/main.py Adds server entrypoint with model loading + uvicorn run.
python/llaisys_py/server/init.py Exposes create_app.
python/llaisys_py/runtime.py Formatting-only change.
python/llaisys_py/ops.py Adds sampling binding; linear bias optional.
python/llaisys_py/models/init.py Formatting-only change.
python/llaisys_py/libllaisys/tensor.py Formatting-only change.
python/llaisys_py/libllaisys/runtime.py Formatting-only change.
python/llaisys_py/libllaisys/qwen2.py Adds ctypes bindings for expanded Qwen2 C API.
python/llaisys_py/libllaisys/ops.py Adds ctypes signature for llaisysSample.
python/llaisys_py/libllaisys/nccl_comm.py Adds ctypes bindings for NCCL comm API.
python/llaisys_py/libllaisys/llaisys_types.py Formatting-only change.
python/llaisys_py/libllaisys/init.py Preloads OpenMP runtime on Linux; loads qwen2 + NCCL bindings.
python/llaisys_py/init.py Normalizes CUDA_VISIBLE_DEVICES at import time; re-exports API.
LICENSE Formatting-only change.
include/llaisys/tensor.h Switches extern wrapper macro usage.
include/llaisys/runtime.h Switches extern wrapper macro usage.
include/llaisys/ops.h Adds llaisysSample C API; switches extern wrapper macro usage.
include/llaisys/ops_nvidia.h Adds CUDA op declaration header for NVIDIA dispatch.
include/llaisys/nccl_comm.h Adds NCCL comm C API header.
include/llaisys/models/qwen2.h Expands Qwen2 C API for batching/TP/KV cache; switches extern wrapper macro usage.
include/llaisys.h Renames __C macro to LLAISYS_EXTERN_C.
docs/install-xmake.md Adds Xmake install instructions for Linux servers.
=42 Adds a pip install log file (likely accidental).
.gitignore Adds model dir ignore; now ignores entire docs/.
.github/workflows/build.yaml Formatting-only change.
.clang-format Formatting-only change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

=42
Comment on lines +1 to +11
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Collecting setuptools
Downloading https://mirrors.aliyun.com/pypi/packages/e1/c6/76dc613121b793286a3f91621d7b75a2b493e0390ddca50f11993eadf192/setuptools-82.0.0-py3-none-any.whl (1.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 24.9 MB/s eta 0:00:00
Collecting wheel
Downloading https://mirrors.aliyun.com/pypi/packages/87/22/b76d483683216dde3d67cba61fb2444be8d5be289bf628c13fc0fd90e5f9/wheel-0.46.3-py3-none-any.whl (30 kB)
Collecting packaging>=24.0 (from wheel)
Downloading https://mirrors.aliyun.com/pypi/packages/b7/b9/c538f279a4e237a006a2c98387d081e9eb060d203d8ed34467cc0f0b9b53/packaging-26.0-py3-none-any.whl (74 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.4/74.4 kB 23.6 MB/s eta 0:00:00
Installing collected packages: setuptools, packaging, wheel
Successfully installed packaging-26.0 setuptools-82.0.0 wheel-0.46.3
Comment on lines +13 to +19
#define EXCEPTION_UNSUPPORTED_DATATYPE(DT__) \
do { \
std::cerr << "[ERROR] Unsupported data type: " \
<< llaisys::utils::dtype_to_str(DT__) \
<< EXCEPTION_LOCATION_MSG << std::endl; \
throw std::runtime_error("Unsupported device"); \
} while (0)
Comment on lines 14 to 17
Runtime::~Runtime() {
if (!_is_active) {
if (!_is_active && !_deactivated_for_shutdown) {
std::cerr << "Mallicious destruction of inactive runtime." << std::endl;
}
Comment on lines +15 to +22
static ncclDataType_t to_nccl_dtype(llaisysDataType_t dtype) {
switch (dtype) {
case LLAISYS_DTYPE_F32: return ncclFloat32;
case LLAISYS_DTYPE_F16:
case LLAISYS_DTYPE_BF16: return ncclFloat16;
case LLAISYS_DTYPE_I64: return ncclInt64;
default: return ncclFloat32;
}
Comment on lines 12 to 20
#ifdef __cplusplus
#define __C extern "C"
#define LLAISYS_EXTERN_C extern "C"
#include <cstddef>
#include <cstdint>
#else
#define __C
#define LLAISYS_EXTERN_C
#include <stddef.h>
#include <stdint.h>
#endif
Comment on lines +17 to +19
for i in range(ndev):
print("Testing device {i}...")
api.set_device(i)
Comment on lines 4 to 7
#include "../llaisys.h"

__C {
LLAISYS_EXTERN_C {
// Runtime API Functions
Comment on lines 4 to 7
#include "../llaisys.h"

__C {
LLAISYS_EXTERN_C {
typedef struct LlaisysTensor *llaisysTensor_t;
Comment on lines 4 to 7
#include "tensor.h"

__C {
LLAISYS_EXTERN_C {
__export void llaisysAdd(llaisysTensor_t c, llaisysTensor_t a, llaisysTensor_t b);
Comment on lines 4 to 7
#include "../tensor.h"

__C {
LLAISYS_EXTERN_C {
struct LlaisysQwen2Meta {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants