Summary
Replace the current non-L3 compiled-kernel execution path, which uses Worker(level=2), with an L3-worker based dispatch path using Worker(level=3) and orch.submit_next_level(...).
As part of this migration, remove the old dedicated run_generate_l3() generation path. L3 execution should become the unified runtime mechanism for prefill/decode kernel dispatch instead of maintaining a separate one-shot L3 generation path.
Area
Executor or runtime
Motivation / Use Case
The current non-L3 path directly runs compiled chip callables through an L2 worker, while the repository also carries a separate one-shot L3 generation path. This creates two runtime models for generation:
- the normal prefill/decode path based on L2 worker dispatch
- the dedicated
run_generate_l3() path with separate generated L3 artifacts and control flow
Newer Simpler/PyPTO runtime features, dependency tagging, child-memory handling, and future serving orchestration are centered around L3 worker execution. Moving the normal prefill/decode kernel path to L3 worker dispatch provides a migration bridge and removes the need to maintain a second L3-specific generation implementation.
This enables:
- serving to keep its existing prefill/decode scheduler semantics
- compiled
@pl.jit prefill/decode kernels to run through the same hierarchical runtime model as future L3 serving
- tensor dependencies to be expressed through
TaskArgs and TensorArgType
- one unified runtime path for offline generation and HTTP serving
- removal of duplicated one-shot L3 generation code once the L3-worker dispatch path covers the same workflows
Related: #18
Implementation PR: #29
Blockers
The current implementation depends on PyPTO runtime/codegen fixes that are tracked as blockers:
Proposed API / Behavior
For non-L3 compiled kernels:
- construct
Worker(level=3, device_ids=[device_id], num_sub_workers=0) or use PyPTO's DistributedWorker wrapper for shared L3 dispatch
- register chip callables before worker initialization
- build
TaskArgs with explicit tensor dependency tags
- submit chip callables inside an orchestration callback using
orch.submit_next_level(...)
- keep KV cache device-resident across prefill and decode dispatches where supported by PyPTO runtime tensors
- pre-share CPU tensors before L3 worker initialization so child processes can access host mappings
- provide runtime wrapper methods for
submit_next_level, orchestrator-scoped memory operations, and cleanup
- remove the old one-shot
run_generate_l3() path and its dedicated generated-L3 artifacts from the serving runner once the L3-worker dispatch path covers offline and HTTP generation
- keep one runtime path for offline generation and serving, with feature flags only for rollout/debugging rather than permanent separate implementations
The first implementation needed conservative worker-lifecycle validation while Simpler behavior was being debugged. The Simpler-side lifecycle issue was tracked in hw-native-sys/simpler#980 and has been closed.
Alternatives Considered
Keep the current Worker(level=2) runtime for non-L3 kernels and only use L3 for the dedicated run_generate_l3() path.
That avoids L3 lifecycle complexity in the short term, but it keeps two separate runtime models in serving, duplicates generation control flow, and makes it harder to migrate scheduled prefill/decode execution into L3 DAG form.
Another alternative is to keep run_generate_l3() as a permanent fast path while adding L3-worker dispatch for serving. That would still leave long-term maintenance cost: separate artifacts, separate sampling/prepare logic, separate KV handling, and separate correctness/performance validation.
Additional Context
The current implementation is PR #29:
PR #29 rewrites non-L3 Qwen3 prefill/decode dispatch through a shared PyPTO L3 DistributedWorker, keeps the KV cache device-resident, and removes the old dedicated one-shot run_generate_l3() path.
The earlier prototype was PR #22:
The Simpler worker lifecycle issue previously referenced by this feature has been closed:
PR #29 passed offline generation with larger ring settings:
task-submit --device auto --max-time 1200 --run \
"cd /data/liuxu/pypto-serving && \
PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_DEP_POOL=1048576 \
SA_PROFILE_OUTPUT=offline_128_l3_worker_prefork_decode_buffers_trace.json \
SA_PROFILE_LEVEL=kernel \
python examples/model/qwen3_14b/npu_generate.py \
--model-dir /data/linyifan/models/Qwen3-14B \
--prompt 'Huawei is' \
--platform a2a3 \
--max-seq-len 512 \
--max-new-tokens 128 \
--profile \
--device-id {}"
Observed output from the 128-token run:
text: a Chinese multinational technology company, and the Huawei Mate 60 is one of its flagship smartphones. The Huawei Mate 60 is equipped with the Kirin 9000 chip, which is a system-on-chip (SoC) developed by Huawei's in-house semiconductor division. The Kirin 9000 is a 5nm process technology, 5G capable SoC that integrates the CPU, GPU, NPU, and other components into a single chip. The Kirin 900 is a powerful processor that offers excellent performance and efficiency for various tasks, including gaming, multimedia, and artificial intelligence
finish_reason: length
Timing from that run:
[perf] generated 128 tokens in 31.024s -> 4.13 tok/s (overall, incl. prefill)
[perf] prefill/TTFT 18.172s | decode 11.542s over 127 steps -> 11.00 tok/s (90.9 ms/token)
Summary
Replace the current non-L3 compiled-kernel execution path, which uses
Worker(level=2), with an L3-worker based dispatch path usingWorker(level=3)andorch.submit_next_level(...).As part of this migration, remove the old dedicated
run_generate_l3()generation path. L3 execution should become the unified runtime mechanism for prefill/decode kernel dispatch instead of maintaining a separate one-shot L3 generation path.Area
Executor or runtime
Motivation / Use Case
The current non-L3 path directly runs compiled chip callables through an L2 worker, while the repository also carries a separate one-shot L3 generation path. This creates two runtime models for generation:
run_generate_l3()path with separate generated L3 artifacts and control flowNewer Simpler/PyPTO runtime features, dependency tagging, child-memory handling, and future serving orchestration are centered around L3 worker execution. Moving the normal prefill/decode kernel path to L3 worker dispatch provides a migration bridge and removes the need to maintain a second L3-specific generation implementation.
This enables:
@pl.jitprefill/decode kernels to run through the same hierarchical runtime model as future L3 servingTaskArgsandTensorArgTypeRelated: #18
Implementation PR: #29
Blockers
The current implementation depends on PyPTO runtime/codegen fixes that are tracked as blockers:
DistributedCompiledProgramSubmitcallees in next-level program extractionProposed API / Behavior
For non-L3 compiled kernels:
Worker(level=3, device_ids=[device_id], num_sub_workers=0)or use PyPTO'sDistributedWorkerwrapper for shared L3 dispatchTaskArgswith explicit tensor dependency tagsorch.submit_next_level(...)submit_next_level, orchestrator-scoped memory operations, and cleanuprun_generate_l3()path and its dedicated generated-L3 artifacts from the serving runner once the L3-worker dispatch path covers offline and HTTP generationThe first implementation needed conservative worker-lifecycle validation while Simpler behavior was being debugged. The Simpler-side lifecycle issue was tracked in hw-native-sys/simpler#980 and has been closed.
Alternatives Considered
Keep the current
Worker(level=2)runtime for non-L3 kernels and only use L3 for the dedicatedrun_generate_l3()path.That avoids L3 lifecycle complexity in the short term, but it keeps two separate runtime models in serving, duplicates generation control flow, and makes it harder to migrate scheduled prefill/decode execution into L3 DAG form.
Another alternative is to keep
run_generate_l3()as a permanent fast path while adding L3-worker dispatch for serving. That would still leave long-term maintenance cost: separate artifacts, separate sampling/prepare logic, separate KV handling, and separate correctness/performance validation.Additional Context
The current implementation is PR #29:
PR #29 rewrites non-L3 Qwen3 prefill/decode dispatch through a shared PyPTO L3
DistributedWorker, keeps the KV cache device-resident, and removes the old dedicated one-shotrun_generate_l3()path.The earlier prototype was PR #22:
The Simpler worker lifecycle issue previously referenced by this feature has been closed:
PR #29 passed offline generation with larger ring settings:
Observed output from the 128-token run:
Timing from that run: