Callable identity update: public hierarchical registration returns
CallableHandle; local IPC task frames carry the handle hash digest and each
target resolves it to a private local slot. See
callable-identity-registration.md.
This document covers:
- The L0–L6 level model (what each level represents)
- The three engine components (Orchestrator / Scheduler / Worker) and their division of responsibility
- How components compose recursively from L3 upward
For details of each component's internals, see:
- orchestrator.md — submit flow, TensorMap, Scope, Ring, task state machine
- scheduler.md — dispatch loop, queues, completion handling
- worker-manager.md — WorkerThread pool, fork + mailbox
- task-flow.md — Callable / TaskArgs / CallConfig data flow, execution leaves
- remote-l3-worker-design.md — design proposal for scheduling remote L3 workers as NEXT_LEVEL children
For the L2 chip-level details (host .so, AICPU, AICore), see
chip-level-arch.md.
The runtime uses a 7-level hierarchy mirroring the physical topology of Ascend NPU clusters:
L6 CLOS2 / Cluster ── full cluster (N6 super-nodes)
L5 CLOS1 / SuperNode ── super-node (N5 pods)
L4 POD / Pod ── pod (4 hosts)
L3 HOST / Node ── single host machine (16 chips + M SubWorkers)
L2 CHIP / Processor ── one NPU chip (shared device memory)
L1 DIE / L2Cache ── chip die (hardware-managed)
L0 CORE / AIV, AIC ── individual compute core (hardware-managed)
L2 is the boundary between two worlds:
- L0–L2 (on-device): AICPU scheduler, AICore/AIV workers, device Global Memory. Managed by the chip-level runtime (see chip-level-arch.md). Communication via shared GM with atomics and barriers.
- L3–L6 (host/cluster): each level runs the same scheduling engine composed of Orchestrator + Scheduler + Worker pool. Communication via IPC (fork + shm for today's L3 and for local recursive L4+ composition). Cross-host L4/L5/L6 composition, where a parent schedules a remote L3 endpoint over RoCE/HCCS/UB/sockets, now has the local endpoint/eligibility boundary and socket-backed simulation session runner implemented; HCOMM hardware profiles are still pending.
| Level | Workers it contains | Status |
|---|---|---|
| L3 (Host) | ChipWorker ×N + SubWorker ×M |
Implemented |
| L4 (Pod) | Worker(level=3) ×N + SubWorker ×M |
Local implemented; remote sim implemented; HCOMM profiles pending |
| L5 (SuperNode) | Worker(level=4) ×N |
Local L4 code path, untested; remote proposed |
| L6 (Cluster) | Worker(level=5) ×N |
Local L4 code path, untested; remote proposed |
Worker is a single C++ class that handles every level from L3 upward — the
level parameter is a diagnostic label; behavior does not branch on it. The
same Orchestrator/Scheduler/Worker code runs unchanged.
Every level L3+ runs three cooperating components. Each has its own dedicated thread in the parent process.
The DAG builder. Exposed to the user's orchestration function as the
first argument of submit_*. Runs single-threaded on the user's thread.
Owns:
Ring— fixed-size slot pool, allocates with back-pressureTensorMap— tensor dependency key to producer slot lookup, drives automatic dep inference. Local keys contain a pointer; remote keys contain buffer identity and logical offset.Scope— lifetime management for intermediate tensors
One submit_next_level(callable, task_args, config) call:
- allocates a slot
- moves task data into the slot
- walks
TaskArgstags (INPUT/OUTPUT/INOUT/OUTPUT_EXISTING/NO_DEP) to lookup/insert TensorMap entries - records fanin metadata on producer slots
- pushes the new slot onto the scheduler's wiring queue
See orchestrator.md for the 7-step submit flow and state machine.
The DAG executor. A dedicated C++ thread that drains three queues:
- wiring queue — slots just submitted; wire fanout edges, compute readiness
- ready queue — slots with all fanin satisfied; pick an idle WorkerThread and dispatch
- completion queue — slots whose worker finished; release fanout, wake downstream consumers, retire slot
The Scheduler never inspects task data — it just moves slot ids between queues and consults TaskSlotState metadata.
See scheduler.md for the dispatch loop and coordination.
The execution layer. WorkerManager holds two pools of WorkerThreads
(next-level pool and sub pool). Each WorkerThread owns one std::thread that
encodes (callable, config, args_blob) into a MAILBOX_SIZE-byte shared
memory region, signals the pre-forked Python child, and spin-polls
TASK_DONE, returning an explicit completion outcome to the Scheduler.
- Next-level (chip) children run
_chip_process_loop, which constructs aChipWorkerand dispatches each kernel through it. - SUB children run
_sub_worker_loop, which decodes the args blob into aTaskArgsand calls the registered Python callable asfn(args). There is no C++SubWorkerclass — SUB workers exist only as a worker-type enum value plus a Python child loop.
See worker-manager.md for the dispatch state machine,
fork ordering, and mailbox layout. See task-flow.md for
what flows through ChipWorker::run.
Orch thread Scheduler thread Worker threads
─────────── ──────────────── ──────────────
User code ──► Orchestrator Scheduler
│ │
│ submit(callable, args, config) │
│ 1. ring.alloc() │
│ 2. TensorMap lookup/insert │
│ 3. record fanin │
│ 4. push wiring_queue ───────►│
│ │ Phase 0: drain wiring_queue
│ │ wire fanout edges
│ │ if ready → ready_queue
│ │ pop ready_queue
│ │ pick idle WorkerThread
│ │ wt.dispatch(slot_id) ──────► WorkerThread
│ │ encode mailbox → spin-poll TASK_DONE
│ │ (blocking; child runs the kernel)
│ │◄── completion_queue ────── on_complete_(completion)
│ │ on_task_complete:
│ │ success → COMPLETED
│ │ failure → FAILED + poison downstream
│ │ try_consume → ring release
│ drain() ◄── notify when all done │
Communication channels:
| Path | Mechanism | Payload |
|---|---|---|
| Orch → Scheduler | wiring_queue (mutex + CV) | slot id |
| Scheduler → WorkerThread | WorkerThread internal queue | slot id |
| WorkerThread → Scheduler | completion_queue (mutex + CV) | slot id + group index + outcome |
| WorkerThread ↔ child | shm mailbox (state + error + task data) | encoded blob |
| Python ↔ C++ | nanobind bindings | TaskArgs / CallConfig / callable handle |
| Tensor data | torch.share_memory_() or host malloc |
zero-copy shared address |
A higher-level Worker can register a lower-level Worker as a
NEXT_LEVEL child through the same mailbox protocol L3 uses for chip
children. The Python Worker.add_worker(child) stores an un-init'd child
Worker; on first run(), the parent forks a child process that inits the
inner Worker and enters a mailbox-polling loop (_child_worker_loop).
# L3 child: sub-only (or with chips via device_ids)
l3 = Worker(level=3, num_sub_workers=1)
l3_sub_handle = l3.register(lambda: verify_result())
def my_l3_orch(orch, args, config):
orch.submit_sub(l3_sub_handle)
# L4 parent
w4 = Worker(level=4, num_sub_workers=0)
l3_handle = w4.register(my_l3_orch)
l3_worker_id = w4.add_worker(l3)
w4.init()
def my_l4_orch(orch, args, config):
orch.submit_next_level(l3_handle, TaskArgs(), CallConfig(), worker=l3_worker_id)
w4.run(my_l4_orch)
w4.close()l3_worker_id is the local child worker id returned by add_worker(...).
It is a public worker id, not necessarily the child's C++ worker-thread
vector index.
When L4's WorkerThread writes a task frame to the L3 child's mailbox, the
frame carries the callable hash digest plus config and args_blob. The child
loop reads the digest, resolves it through its local identity table to a private
orch-function slot, and calls inner_worker.run(orch_fn, args, cfg). The inner
Worker opens its own scope, executes the orch function with its own
Orchestrator, and drains. Each level's orch fn receives its own Orchestrator
— recursion is symmetric.
Nested fork ordering: L3's own children (sub/chip) are forked inside
the L4 child process, on L3's first run(). This keeps the process tree
clean: L4 parent → L3 child → L3's sub/chip grandchildren.
Mode per level is independent: L3 can use PROCESS (chip children), while
L4 also uses PROCESS (L3 Worker children). Each Worker picks its children's
mode independently. Nested forks are safe because L3 init happens inside the
already-forked L3 child process.
See task-flow.md §9 for the full recursive data-flow walk-through.
| Concern | Python layer | C++ layer |
|---|---|---|
| Process lifecycle | fork() timing, SharedMemory alloc/unlink, waitpid |
— |
| Callable registration | owns handle/hashid registries and child-local Python dispatch mappings | — |
| Orchestration DAG | user's orch fn, submit_* calls |
Orchestrator::submit_* engine |
| Scheduling | — | Scheduler thread, queues, WorkerThread pool |
| Dispatch | — | WorkerThread::dispatch → WorkerEndpoint::run, mailbox IPC for local endpoints |
| Runtime execution | — | ChipWorker via dlsym'd runtime .so |
Python handles when things happen (fork ordering, lifecycle). C++ handles how fast (threading, atomics, zero-copy dispatch).
┌──────────────────────────────────────────────────────────────┐
│ Parent (main) process │
│ │
│ Python main thread (Orch) │
│ │ │
│ ├── C++ Scheduler thread │
│ ├── C++ WorkerThread[0] ── shm mailbox ──► chip child 0 │
│ ├── C++ WorkerThread[1] ── shm mailbox ──► chip child 1 │
│ ├── C++ WorkerThread[2] ── shm mailbox ──► sub child 0 │
│ └── C++ WorkerThread[3] ── shm mailbox ──► sub child 1 │
│ │
└─────────────────────────────┬────────────────────────────────┘
│ fork() (before any C++ thread starts)
┌─────────────────┼─────────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Chip child 0 │ │ Chip child 1 │
│ poll mailbox │ … │ poll mailbox │
│ ChipWorker.run │ │ ChipWorker.run │
└─────────────────┘ └─────────────────┘
Fork ordering invariant: Python forks every child process FIRST, before
any C++ Scheduler / WorkerThread is started. This avoids the classical
fork-in-a-multi-threaded-process hazard.
A single device can only run one runtime per CANN process context. CANN's
AICPU framework (libaicpu_extend_kernels.so) caches the user AICPU .so on
first load and skips reloading on subsequent launches. If a different
runtime's AICPU .so is launched on the same device, the cached (stale)
function pointers are used, causing hangs.
Do not reuse a device across different runtimes within a single process. Use separate processes (one per runtime), or partition devices so each runtime gets exclusive devices. See testing.md for the pytest device allocation algorithm.
| Path | Role |
|---|---|
src/common/hierarchical/orchestrator.{h,cpp} |
Orchestrator: submit, TensorMap, Scope |
src/common/hierarchical/scheduler.{h,cpp} |
Scheduler: dispatch loop + queues |
src/common/hierarchical/worker_manager.{h,cpp} |
WorkerManager + WorkerThread: pool, mailbox-IPC dispatch |
src/common/hierarchical/remote_endpoint.{h,cpp} |
RemoteL3Endpoint transport-neutral TASK/COMPLETION boundary |
src/common/hierarchical/remote_wire.{h,cpp} |
Versioned remote L3 frame codec |
src/common/hierarchical/worker.{h,cpp} |
Worker (L3+): composes the above |
src/common/hierarchical/ring.{h,cpp} |
slot allocator |
src/common/hierarchical/tensormap.{h,cpp} |
base_ptr → producer slot |
src/common/hierarchical/scope.{h,cpp} |
scope lifetime management |
src/common/worker/chip_worker.{h,cpp} |
L2 ChipWorker (kernel-running leaf, runs in the forked chip child) |
python/bindings/ |
nanobind exposure of C++ engine to Python |
python/simpler/worker.py |
Python Worker factory + lifecycle wrapper |