Skip to content

[Feature] Replace L2 kernel runtime and old L3 generate path with unified L3 worker dispatch #26

Description

@ndleslx

Summary

Replace the current non-L3 compiled-kernel execution path, which uses Worker(level=2), with an L3-worker based dispatch path using Worker(level=3) and orch.submit_next_level(...).

As part of this migration, remove the old dedicated run_generate_l3() generation path. L3 execution should become the unified runtime mechanism for prefill/decode kernel dispatch instead of maintaining a separate one-shot L3 generation path.

Area

Executor or runtime

Motivation / Use Case

The current non-L3 path directly runs compiled chip callables through an L2 worker, while the repository also carries a separate one-shot L3 generation path. This creates two runtime models for generation:

  • the normal prefill/decode path based on L2 worker dispatch
  • the dedicated run_generate_l3() path with separate generated L3 artifacts and control flow

Newer Simpler/PyPTO runtime features, dependency tagging, child-memory handling, and future serving orchestration are centered around L3 worker execution. Moving the normal prefill/decode kernel path to L3 worker dispatch provides a migration bridge and removes the need to maintain a second L3-specific generation implementation.

This enables:

  • serving to keep its existing prefill/decode scheduler semantics
  • compiled @pl.jit prefill/decode kernels to run through the same hierarchical runtime model as future L3 serving
  • tensor dependencies to be expressed through TaskArgs and TensorArgType
  • one unified runtime path for offline generation and HTTP serving
  • removal of duplicated one-shot L3 generation code once the L3-worker dispatch path covers the same workflows

Related: #18

Implementation PR: #29

Blockers

The current implementation depends on PyPTO runtime/codegen fixes that are tracked as blockers:

Proposed API / Behavior

For non-L3 compiled kernels:

  • construct Worker(level=3, device_ids=[device_id], num_sub_workers=0) or use PyPTO's DistributedWorker wrapper for shared L3 dispatch
  • register chip callables before worker initialization
  • build TaskArgs with explicit tensor dependency tags
  • submit chip callables inside an orchestration callback using orch.submit_next_level(...)
  • keep KV cache device-resident across prefill and decode dispatches where supported by PyPTO runtime tensors
  • pre-share CPU tensors before L3 worker initialization so child processes can access host mappings
  • provide runtime wrapper methods for submit_next_level, orchestrator-scoped memory operations, and cleanup
  • remove the old one-shot run_generate_l3() path and its dedicated generated-L3 artifacts from the serving runner once the L3-worker dispatch path covers offline and HTTP generation
  • keep one runtime path for offline generation and serving, with feature flags only for rollout/debugging rather than permanent separate implementations

The first implementation needed conservative worker-lifecycle validation while Simpler behavior was being debugged. The Simpler-side lifecycle issue was tracked in hw-native-sys/simpler#980 and has been closed.

Alternatives Considered

Keep the current Worker(level=2) runtime for non-L3 kernels and only use L3 for the dedicated run_generate_l3() path.

That avoids L3 lifecycle complexity in the short term, but it keeps two separate runtime models in serving, duplicates generation control flow, and makes it harder to migrate scheduled prefill/decode execution into L3 DAG form.

Another alternative is to keep run_generate_l3() as a permanent fast path while adding L3-worker dispatch for serving. That would still leave long-term maintenance cost: separate artifacts, separate sampling/prepare logic, separate KV handling, and separate correctness/performance validation.

Additional Context

The current implementation is PR #29:

PR #29 rewrites non-L3 Qwen3 prefill/decode dispatch through a shared PyPTO L3 DistributedWorker, keeps the KV cache device-resident, and removes the old dedicated one-shot run_generate_l3() path.

The earlier prototype was PR #22:

The Simpler worker lifecycle issue previously referenced by this feature has been closed:

PR #29 passed offline generation with larger ring settings:

task-submit --device auto --max-time 1200 --run \
  "cd /data/liuxu/pypto-serving && \
   PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_DEP_POOL=1048576 \
   SA_PROFILE_OUTPUT=offline_128_l3_worker_prefork_decode_buffers_trace.json \
   SA_PROFILE_LEVEL=kernel \
   python examples/model/qwen3_14b/npu_generate.py \
     --model-dir /data/linyifan/models/Qwen3-14B \
     --prompt 'Huawei is' \
     --platform a2a3 \
     --max-seq-len 512 \
     --max-new-tokens 128 \
     --profile \
     --device-id {}"

Observed output from the 128-token run:

text:  a Chinese multinational technology company, and the Huawei Mate 60 is one of its flagship smartphones. The Huawei Mate 60 is equipped with the Kirin 9000 chip, which is a system-on-chip (SoC) developed by Huawei's in-house semiconductor division. The Kirin 9000 is a 5nm process technology, 5G capable SoC that integrates the CPU, GPU, NPU, and other components into a single chip. The Kirin 900 is a powerful processor that offers excellent performance and efficiency for various tasks, including gaming, multimedia, and artificial intelligence
finish_reason: length

Timing from that run:

[perf] generated 128 tokens in 31.024s -> 4.13 tok/s (overall, incl. prefill)
[perf] prefill/TTFT 18.172s | decode 11.542s over 127 steps -> 11.00 tok/s (90.9 ms/token)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions