Chip-Level Architecture (L2)

This document describes the single-chip (L2) architecture — how a host program, AICPU kernel, and AICore kernel cooperate on one Ascend NPU chip. For the multi-chip hierarchy (L3+: Orchestrator / Scheduler / Worker composition) see hierarchical_level_runtime.md. For how task data (Callable / TaskArgs / CallConfig) flows through all levels, see task-flow.md.

Three-Program Model

The PTO Runtime consists of three separate programs that communicate through well-defined APIs:

┌─────────────────────────────────────────────────────────────┐
│                    Python Application                        │
│   (pytest @scene_test classes, or `python test_*.py`)        │
└─────────────────────────┬───────────────────────────────────┘
                          │
         ┌────────────────┼────────────────┐
         │                │                │
    nanobind          ChipWorker       RuntimeBuilder
    (task_interface)  (dlopen host.so)  (compile binaries)
         │                │                │
         ▼                ▼                ▼
┌──────────────────┐  ┌──────────────────┐
│   Host Runtime   │  │   Binary Data    │
│ (src/{arch}/     │  │  (AICPU + AICore)│
│  platform/)      │  └──────────────────┘
├──────────────────┤         │
│ DeviceRunner     │         │
│ Runtime          │    Loaded at runtime
│ MemoryAllocator  │         │
│ C API            │         │
└────────┬─────────┘         │
         │                   │
         └───────────────────┘
                 │
                 ▼
    ┌────────────────────────────┐
    │  Ascend Device (Hardware)   │
    ├────────────────────────────┤
    │ AICPU: Task Scheduler       │
    │ AICore: Compute Kernels     │
    └────────────────────────────┘

Components

1. Host Runtime (`src/{arch}/platform/*/host/`)

C++ library - Device orchestration and management

DeviceRunner: Handle-based device context manager (one per ChipWorker)
MemoryAllocator: Device tensor memory management
pto_runtime_c_api.h: Pure C API for ChipWorker bindings (src/common/worker/pto_runtime_c_api.h)
Compiled to shared library (.so) at runtime

Key Responsibilities:

Allocate/free device memory
Host <-> Device data transfer
AICPU kernel launching and configuration
AICore kernel registration and loading
Runtime execution workflow coordination

2. AICPU Kernel (`src/{arch}/platform/*/aicpu/`)

Device program - Task scheduler running on AICPU processor

kernel.cpp: Kernel entry points and handshake protocol
Runtime-specific executor in src/{arch}/runtime/*/aicpu/
Compiled to device binary at build time

Key Responsibilities:

Initialize handshake protocol with AICore cores
Wire fanout dependency edges from orchestrator's wiring queue (scheduler thread 0)
Identify ready tasks (fanin satisfied) and enqueue to ready queues
Dispatch ready tasks to idle AICore cores
Track task completion and notify downstream consumers
Continue until all tasks complete

3. AICore Kernel (`src/{arch}/platform/*/aicore/`)

Device program - Computation kernels executing on AICore processors

kernel.cpp: Task execution kernels (add, mul, etc.)
Runtime-specific executor in src/{arch}/runtime/*/aicore/
Compiled to object file (.o) at build time

Key Responsibilities:

Wait for task assignment via handshake buffer
Read task arguments and kernel address
Execute kernel using PTO ISA
Signal task completion
Poll for next task or quit signal

API Layers

Layer 1: C++ API (`src/{arch}/platform/*/host/device_runner.h`)

DeviceRunner runner;
void *ptr = runner.allocate_tensor(bytes);
runner.copy_to_device(dev_ptr, host_ptr, bytes);
runner.set_executors(aicpu_binary, aicore_binary);   // once, at init time
runner.run(runtime, config);                         // config carries block_dim, aicpu_thread_num, diagnostics
runner.finalize();

Layer 2: C API (`src/common/worker/pto_runtime_c_api.h`)

// libsimpler_log.so (RTLD_GLOBAL, loaded first by the Python wrapper):
simpler_log_init(log_level, log_info_v);              // seed HostLogger once

// host_runtime.so (RTLD_LOCAL, loaded after):
DeviceContextHandle ctx = create_device_context();
simpler_init(ctx, device_id,                          // attach + binary takeover
             aicpu_binary, aicpu_size,
             aicore_binary, aicore_size);
size_t size = get_runtime_size();
register_callable(ctx, cid, callable);                 // one-time per callable
simpler_run(ctx, runtime, cid, args, config);        // per-launch — no binaries; config
                                                       // carries block_dim, aicpu_thread_num,
                                                       // diagnostics + ring overrides
unregister_callable(ctx, cid);
finalize_device(ctx);
destroy_device_context(ctx);

Layer 3: Python API (`python/bindings/task_interface.cpp` via nanobind)

from simpler.task_interface import ChipWorker, ChipCallable, ChipStorageTaskArgs, CallConfig

worker = ChipWorker()
worker.init(device_id=0, bins=bins)   # bins = RuntimeBuilder(platform).get_binaries(...)

config = CallConfig()
# config.block_dim defaults to 0 = auto (DeviceRunner resolves to the max
# the AICore stream allows). Set explicitly to pin a smaller value.
config.aicpu_thread_num = 3
config.enable_pmu = 0
worker.run(callable, args, config)
worker.finalize()

Python Type Naming Convention

Layer 3 Python types use a level-prefixed naming convention that mirrors the level model (see hierarchical_level_runtime.md):

Concept	L2 (Chip) type	L3+ (Distributed) type	Unified factory
Worker	`ChipWorker`	`Worker`	`Worker(level=N)`
Callable	`ChipCallable`	(planned)	—
TaskArgs	`ChipStorageTaskArgs`	(planned)	—
Config	`CallConfig`	`CallConfig`	—

CallConfig is the exception — same type used at every level, with no Chip* / unprefixed split (see task-flow.md for details). The unified Worker(level=N) factory already routes to the correct backend. When new level-specific types are added (e.g. ChipCallable), each concept should follow the same pattern: a Chip* concrete type for L2, a prefix-less concrete type for L3+, and optionally a factory function that routes by level.

Execution Flow

1. Python Setup Phase

Python test_*.py (SceneTestCase)
  │
  ├─→ RuntimeBuilder(platform).get_binaries(runtime_name) → host.so, aicpu.so, aicore.o
  ├─→ KernelCompiler(platform).compile_incore(source, core_type) → kernel .o/.so
  ├─→ KernelCompiler(platform).compile_orchestration(runtime, source) → orch .so
  │
  └─→ ChipWorker()
       └─→ init(device_id, bins)                          # Python wrapper
            ├─→ ctypes.CDLL(libsimpler_log.so, RTLD_GLOBAL)   # once per process
            ├─→ simpler_log_init(log_level, log_info_v) → HostLogger seeded
            ├─→ ctypes.CDLL(libcpu_sim_context.so, RTLD_GLOBAL)  # sim only, once
            └─→ _ChipWorker.init(host_path, aicpu_path, aicore_path, device_id)  # C++
                 ├─→ dlopen(host.so, RTLD_LOCAL) → resolve C API symbols via dlsym
                 ├─→ create_device_context() → DeviceContextHandle
                 └─→ simpler_init(ctx, device_id, aicpu*, aicpu_size, aicore*, aicore_size)
                      ├─→ (onboard) dlog_setlevel(HostLogger.level())   # before context open
                      ├─→ DeviceRunner::attach_current_thread(device_id)
                      │    ├─→ rtSetDevice(device_id) on onboard
                      │    └─→ pto_cpu_sim_bind+acquire on sim
                      └─→ DeviceRunner::set_executors(aicpu, aicore)

2. Initialization Phase

The thread that called init() is now attached to device_id. Streams are created lazily on the first run() call (prepare_run_context). Subsequent device-ops (malloc, copy_to, copy_from, free) reuse that per-thread binding — they must be called from the same thread that called init().

3. Execution Phase

worker.run(callable, args, CallConfig(block_dim, aicpu_thread_num))
  │
  └─→ run_runtime(ctx, runtime, callable, args, ...)
       │
       ├─→ Upload the entire ChipCallable buffer (upload_chip_callable_buffer)
       │      then fill func_id_to_addr_[fid] = chip_dev + storage_offset + child_offset(i)
       ├─→ Allocate device tensors via MemoryAllocator
       ├─→ Copy input data to device
       ├─→ Build task graph with dependencies
       │
       ├─→ Copy Runtime to device memory
       │
       ├─→ LaunchAiCpuKernel (main scheduler kernel)
       │    └─→ Execute on AICPU: Task scheduler loop
       │         ├─→ Find initially ready tasks
       │         ├─→ Loop: dispatch tasks, wait for completion
       │         └─→ Continue until all tasks done
       │
       ├─→ LaunchAicoreKernel
       │    └─→ Execute on AICore cores: Task workers
       │         ├─→ Wait for task assignment
       │         ├─→ Execute kernel
       │         └─→ Signal completion, repeat
       │
       ├─→ rtStreamSynchronize (wait for completion)
       │
       ├─→ Copy results from device to host
       └─→ Clean up device tensors and runtime

4. Finalization Phase

worker.finalize()
  │
  └─→ finalize_device(ctx)
       ├─→ Release device resources
       └─→ destroy_device_context(ctx)

Handshake Protocol

AICPU and AICore cores coordinate via handshake buffers (one per core):

struct Handshake {
    volatile uint32_t aicpu_ready;   // AICPU→AICore: scheduler ready
    volatile uint32_t aicore_done;   // AICore→AICPU: core ready
    volatile uint64_t task;          // AICPU→AICore: task pointer (init only; runtime uses DATA_MAIN_BASE)
};

Flow:

AICPU finds a ready task
AICPU writes task pointer to handshake buffer and signals via DATA_MAIN_BASE register
AICore polls DATA_MAIN_BASE, reads the task, executes
AICore writes FIN to COND; AICPU observes completion
AICPU reads result and continues

Platform Backends

Two backends under src/{arch}/platform/: onboard/ (real Ascend hardware) and sim/ (thread-based host simulation, no SDK required).

See per-arch platform docs: a2a3, a5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chip-Level Architecture (L2)

Three-Program Model

Components

1. Host Runtime (`src/{arch}/platform/*/host/`)

2. AICPU Kernel (`src/{arch}/platform/*/aicpu/`)

3. AICore Kernel (`src/{arch}/platform/*/aicore/`)

API Layers

Layer 1: C++ API (`src/{arch}/platform/*/host/device_runner.h`)

Layer 2: C API (`src/common/worker/pto_runtime_c_api.h`)

Layer 3: Python API (`python/bindings/task_interface.cpp` via nanobind)

Python Type Naming Convention

Execution Flow

1. Python Setup Phase

2. Initialization Phase

3. Execution Phase

4. Finalization Phase

Handshake Protocol

Platform Backends

Uh oh!

FilesExpand file tree

chip-level-arch.md

Latest commit

History

chip-level-arch.md

File metadata and controls

Chip-Level Architecture (L2)

Three-Program Model

Components

1. Host Runtime (src/{arch}/platform/*/host/)

2. AICPU Kernel (src/{arch}/platform/*/aicpu/)

3. AICore Kernel (src/{arch}/platform/*/aicore/)

API Layers

Layer 1: C++ API (src/{arch}/platform/*/host/device_runner.h)

Layer 2: C API (src/common/worker/pto_runtime_c_api.h)

Layer 3: Python API (python/bindings/task_interface.cpp via nanobind)

Python Type Naming Convention

Execution Flow

1. Python Setup Phase

2. Initialization Phase

3. Execution Phase

4. Finalization Phase

Handshake Protocol

Platform Backends

1. Host Runtime (`src/{arch}/platform/*/host/`)

2. AICPU Kernel (`src/{arch}/platform/*/aicpu/`)

3. AICore Kernel (`src/{arch}/platform/*/aicore/`)

Layer 1: C++ API (`src/{arch}/platform/*/host/device_runner.h`)

Layer 2: C API (`src/common/worker/pto_runtime_c_api.h`)

Layer 3: Python API (`python/bindings/task_interface.cpp` via nanobind)