[Feature] Replace L2 kernel runtime and old L3 generate path with unified L3 worker dispatch

### Summary

Replace the current non-L3 compiled-kernel execution path, which uses `Worker(level=2)`, with an L3-worker based dispatch path using `Worker(level=3)` and `orch.submit_next_level(...)`.

As part of this migration, remove the old dedicated `run_generate_l3()` generation path. L3 execution should become the unified runtime mechanism for prefill/decode kernel dispatch instead of maintaining a separate one-shot L3 generation path.

### Area

Executor or runtime

### Motivation / Use Case

The current non-L3 path directly runs compiled chip callables through an L2 worker, while the repository also carries a separate one-shot L3 generation path. This creates two runtime models for generation:

- the normal prefill/decode path based on L2 worker dispatch
- the dedicated `run_generate_l3()` path with separate generated L3 artifacts and control flow

Newer Simpler/PyPTO runtime features, dependency tagging, child-memory handling, and future serving orchestration are centered around L3 worker execution. Moving the normal prefill/decode kernel path to L3 worker dispatch provides a migration bridge and removes the need to maintain a second L3-specific generation implementation.

This enables:

- serving to keep its existing prefill/decode scheduler semantics
- compiled `@pl.jit` prefill/decode kernels to run through the same hierarchical runtime model as future L3 serving
- tensor dependencies to be expressed through `TaskArgs` and `TensorArgType`
- one unified runtime path for offline generation and HTTP serving
- removal of duplicated one-shot L3 generation code once the L3-worker dispatch path covers the same workflows

Related: #18

Implementation PR: #29

### Blockers

The current implementation depends on PyPTO runtime/codegen fixes that are tracked as blockers:

- https://github.com/hw-native-sys/pypto/issues/1698 - Support multi-program dispatch for `DistributedCompiledProgram`
- https://github.com/hw-native-sys/pypto/issues/1707 - Distributed codegen misses `Submit` callees in next-level program extraction

### Proposed API / Behavior

For non-L3 compiled kernels:

- construct `Worker(level=3, device_ids=[device_id], num_sub_workers=0)` or use PyPTO's `DistributedWorker` wrapper for shared L3 dispatch
- register chip callables before worker initialization
- build `TaskArgs` with explicit tensor dependency tags
- submit chip callables inside an orchestration callback using `orch.submit_next_level(...)`
- keep KV cache device-resident across prefill and decode dispatches where supported by PyPTO runtime tensors
- pre-share CPU tensors before L3 worker initialization so child processes can access host mappings
- provide runtime wrapper methods for `submit_next_level`, orchestrator-scoped memory operations, and cleanup
- remove the old one-shot `run_generate_l3()` path and its dedicated generated-L3 artifacts from the serving runner once the L3-worker dispatch path covers offline and HTTP generation
- keep one runtime path for offline generation and serving, with feature flags only for rollout/debugging rather than permanent separate implementations

The first implementation needed conservative worker-lifecycle validation while Simpler behavior was being debugged. The Simpler-side lifecycle issue was tracked in hw-native-sys/simpler#980 and has been closed.

### Alternatives Considered

Keep the current `Worker(level=2)` runtime for non-L3 kernels and only use L3 for the dedicated `run_generate_l3()` path.

That avoids L3 lifecycle complexity in the short term, but it keeps two separate runtime models in serving, duplicates generation control flow, and makes it harder to migrate scheduled prefill/decode execution into L3 DAG form.

Another alternative is to keep `run_generate_l3()` as a permanent fast path while adding L3-worker dispatch for serving. That would still leave long-term maintenance cost: separate artifacts, separate sampling/prepare logic, separate KV handling, and separate correctness/performance validation.

### Additional Context

The current implementation is PR #29:

- https://github.com/hw-native-sys/pypto-serving/pull/29

PR #29 rewrites non-L3 Qwen3 prefill/decode dispatch through a shared PyPTO L3 `DistributedWorker`, keeps the KV cache device-resident, and removes the old dedicated one-shot `run_generate_l3()` path.

The earlier prototype was PR #22:

- https://github.com/hw-native-sys/pypto-serving/pull/22

The Simpler worker lifecycle issue previously referenced by this feature has been closed:

- https://github.com/hw-native-sys/simpler/issues/980

PR #29 passed offline generation with larger ring settings:

```bash
task-submit --device auto --max-time 1200 --run \
  "cd /data/liuxu/pypto-serving && \
   PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=1048576 PTO2_RING_DEP_POOL=1048576 \
   SA_PROFILE_OUTPUT=offline_128_l3_worker_prefork_decode_buffers_trace.json \
   SA_PROFILE_LEVEL=kernel \
   python examples/model/qwen3_14b/npu_generate.py \
     --model-dir /data/linyifan/models/Qwen3-14B \
     --prompt 'Huawei is' \
     --platform a2a3 \
     --max-seq-len 512 \
     --max-new-tokens 128 \
     --profile \
     --device-id {}"
```

Observed output from the 128-token run:

```text
text:  a Chinese multinational technology company, and the Huawei Mate 60 is one of its flagship smartphones. The Huawei Mate 60 is equipped with the Kirin 9000 chip, which is a system-on-chip (SoC) developed by Huawei's in-house semiconductor division. The Kirin 9000 is a 5nm process technology, 5G capable SoC that integrates the CPU, GPU, NPU, and other components into a single chip. The Kirin 900 is a powerful processor that offers excellent performance and efficiency for various tasks, including gaming, multimedia, and artificial intelligence
finish_reason: length
```

Timing from that run:

```text
[perf] generated 128 tokens in 31.024s -> 4.13 tok/s (overall, incl. prefill)
[perf] prefill/TTFT 18.172s | decode 11.542s over 127 steps -> 11.00 tok/s (90.9 ms/token)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Replace L2 kernel runtime and old L3 generate path with unified L3 worker dispatch #26

Summary

Area

Motivation / Use Case

Blockers

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Replace L2 kernel runtime and old L3 generate path with unified L3 worker dispatch #26

Description

Summary

Area

Motivation / Use Case

Blockers

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions