[Feature] Platform Management Design

### Summary

## Purpose

This document defines the platform-management design and its relationship with the existing PyPTO Serving layers.

The goal is to add a platform-management part to PyPTO Serving without changing the repository's core design principle: keep the shortest path from request scheduling to PyPTO/Simpler execution. Platform management should make the system more scalable and robust, but it should not become another model-execution abstraction layer.

## Context

PyPTO Serving currently focuses on a minimal local inference path:

```text
HTTP / CLI
  -> Async serving control plane
  -> Scheduler and KV cache manager
  -> Worker process
  -> PyPTO executor
  -> Simpler runtime
  -> Ascend NPU kernels
```

The platform proposal adds a separate layer for distributed-system management. It addresses the problems that appear when a local inference path becomes a multi-instance service:

- Workload spikes require dynamic resource scaling.
- Placement should be aware of latency, topology, and application-specific traffic patterns.
- Faults should be handled without restarting the whole service.
- The deployment should be reconfigurable so model partitions and kernels can move to better resources.

## Design Split

Serving should be split into two major submodules:

| Submodule | Owner | Responsibility |
| --- | --- | --- |
| Platform | VNL platform work | Start and scale the distributed system, create communication channels, manage replicas, monitor health, react to faults, and expose topology/resource changes to the model layer. |
| Model support | PyPTO Serving model layer | Handle LLM-specific behavior: request lifecycle, batching, model partition execution, KV cache policy, prefix reuse, token scheduling, sampling, and model-specific PyPTO executor logic. |

The platform submodule owns the process during bootstrap. After the initial deployment is ready, it hands request execution to the model-support submodule and becomes a passive management service. From that point on, it reacts to explicit model-support requests and health/resource events.

Examples of platform requests from the model-support layer:

- Spawn more replicas for a partition when demand exceeds capacity.
- Remove or drain replicas when demand drops.
- Create or delete data channels between partitions.
- Reconfigure model partition placement based on topology or resource availability.
- Start periodic services such as heartbeat, monitoring, and scaling decisions.
- Repair unhealthy instances and publish updated channel/endpoint metadata to the model layer.

## Target Architecture

```text
Client / API
  |
  v
Service entry layer
  |
  v
Async serving control plane
  |
  +-------------------- Platform management API --------------------+
  |                                                                 |
  |  Deployment manager                                             |
  |    - owns desired deployment state                              |
  |    - maps model partitions to instances                         |
  |    - asks runtime/cloud/Lingqu backend to spawn or remove nodes  |
  |                                                                 |
  |  Channel manager                                                |
  |    - creates device payload channels between model partitions    |
  |    - creates control channels between coordinators and replicas  |
  |                                                                 |
  |  Health and monitoring                                          |
  |    - heartbeat                                                  |
  |    - replica status                                             |
  |    - load and capacity signals                                  |
  |                                                                 |
  |  Scaling and placement policy                                   |
  |    - ramp up/down replicas                                      |
  |    - topology-aware placement                                   |
  |    - fault response and replacement                             |
  +-----------------------------------------------------------------+
  |
  v
Scheduler / KV cache / model support
  |
  v
Partition coordinators and replicas
  |
  v
PyPTO executor -> Simpler -> Ascend NPU
```

The platform-management API sits beside the current async serving control plane. It should not sit between the scheduler and executor for every token step. Token-level execution should remain on the current minimal path.

## Feature Overview

Platform management has several responsibilities. This section keeps them at overview level so the first implementation can focus on one narrow feature without losing the larger design direction.

### Host and Device Boundary

The platform layer must keep a strict boundary between host-side orchestration and device-side execution.

Host-side tasks are control tasks. A task running on the host should not execute model kernels or move tensors through host memory as the steady-state data path. Its responsibility is to launch, configure, supervise, and terminate a Simpler runtime instance for the assigned partition or replica.

Device-side runtime instances do the actual model work:

- The host task starts a Simpler runtime instance with the partition's model subset, device placement, and channel descriptors.
- PyPTO operators execute on the Ascend device through that Simpler runtime instance.
- Tensor payload channels between partitions should be created as device-side channels whenever they move activations, token tensors, logits, KV-related tensors, or other model data.
- Host-side channels should be limited to control-plane traffic: lifecycle commands, readiness, health, monitoring, placement decisions, and error reporting.
- Data should not bounce through host memory between model partitions unless there is an explicit fallback or debug mode.

This means a platform task is not the model executor itself. It is the host-side launcher and supervisor for a device-side executor.

### Deployment Model

The platform layer should describe the service as a deployment graph:

- A `deployment` is the desired distributed application.
- A `partition` is a model or service stage that owns a subset of work.
- A `task` is the host-side launcher/supervisor for the partition's Simpler runtime instance.
- An `edge` is a data/control channel between partitions, with tensor payload channels placed on device when they carry model data.
- A `replica` is an instance that can execute a partition's work.
- A `coordinator` owns routing and load balancing for a partition.

This maps to the platform configuration model:

| Concept | Platform role | Meaning for PyPTO Serving |
| --- | --- | --- |
| Deployment | Top-level desired state | Whole serving application, including partitions, edges, request manager, heartbeat settings, and control-buffer settings. |
| Partition | Logical execution stage | A shard/stage of model execution, for example a pipeline stage, expert group, prefill stage, or decode stage. |
| Replica | Concrete runtime instance | A concrete instance assigned to run a partition. |
| Task | Host-side launcher | The control function that starts and supervises a partition-local Simpler runtime instance. |
| Edge | Communication link | A channel between producers and consumers. Tensor payload edges should be device-side; control edges can be host-side. |

The deployment model needs enough metadata to describe where host control tasks run, which Simpler runtime instance they launch, where model execution happens, and where channels are placed. The detailed API proposal for this is in the deep dive below.

### Coordinator and Replica Pattern

The platform uses coordinator/replica roles to reduce deployment complexity when scaling out a partition.

At a high level:

- The coordinator receives jobs from upstream partitions or the serving control plane.
- The coordinator tracks available replicas.
- The coordinator sends ready jobs to replicas over host-side control channels.
- A replica's host-side task launches or references the partition-local Simpler runtime instance.
- The Simpler runtime instance executes the partition-local model function on the device.
- Tensor outputs flow over device-side data channels when another model partition consumes them.
- The coordinator marks the replica available again and forwards completion/status metadata.

The coordinator should not inspect token-level model details or tensor payloads. It only sees requests, readiness, health, placement metadata, and channel metadata.

### Channel Management

The platform layer owns channel lifecycle. Model support should request channels by intent rather than constructing distributed transport directly.

Channel placement follows the host/device boundary:

- Tensor payload channels are device-side channels. They move activations, token tensors, logits, KV-related tensors, and partition outputs directly between device-side runtime instances.
- Control channels are host-side channels. They move lifecycle commands, job metadata, readiness, health, monitoring, and errors.
- Host-mediated tensor movement is a fallback/debug path, not the normal serving path.

The channel controller should follow a desired-state reconciliation model: register desired producers and consumers, compare desired channels against actual channels, create missing channels through the selected backend's resource-exchange mechanism, and remove stale channels when they are no longer desired.

### Health and Monitoring

Reliability and zero-downtime fault response are core platform goals. Heartbeat provides liveness, while monitoring provides load and capacity signals.

Recovery routing is owned by the platform:

- If a replica becomes unhealthy, the coordinator stops assigning it new jobs.
- The platform computes the replacement placement, requests a new instance if needed, and rebuilds the required host control channels and device tensor channels.
- The platform updates model support with new channel handles, endpoint descriptors, or partition routing metadata.
- Model support may pause, fail, or replay affected in-flight work according to model semantics, but it should not choose the replacement instance or reroute topology.

Useful monitoring signals include queue depth, pending requests, decode-token throughput, prefill latency, NPU memory use, KV-cache pressure, and channel backpressure.

### Scaling and Placement

Dynamic scaling is a later capability built on the same deployment and channel metadata.

Scale-up can be triggered by queue depth, latency targets, token-throughput saturation, KV-cache pressure, or failed capacity. Scale-down should be drain based: mark a replica draining, stop assigning new jobs, wait for in-flight work or timeout, unregister from the coordinator, delete channels, and release backend resources.

Topology-aware placement should consider NPU topology, device memory, channel locality, partition type, and KV-cache locality. The model layer can provide pressure and locality hints, but the platform owns placement decisions.

### Supported Parallelism

The platform should be able to describe and place several forms of parallelism:

- Pipeline parallelism: each pipeline stage is a partition connected by edges.
- Expert parallelism: expert subsets can be placed as partitions and selected through model-support routing logic.
- Batch parallelism: whole-model or stage replicas can be scaled and load-balanced by coordinators.
- Tensor parallelism: model support owns tensor-level execution details; platform provides placement and channels for tensor-parallel groups.

The boundary is important: platform describes placement, channels, and replica lifecycle; PyPTO Serving model support describes how a partition computes.

## Deep Dive: Static Deployment API

The first implementation target should be static deployment description and startup. This is the easiest useful feature because it defines the host/device boundary and the API shape without requiring dynamic scaling, failure replacement, or topology-aware reconfiguration.

This deep dive is a proposed API shape, not an existing schema in the repository.

### Goals

- Describe partitions, replicas, host tasks, Simpler runtime instances, and channels.
- Make host/device placement explicit.
- Start a fixed deployment from a desired-state spec.
- Launch one Simpler runtime instance per replica task.
- Create host control channels and device tensor channels.
- Publish runtime and channel descriptors to model support.

### Non-Goals for the First API

- No automatic scale-up or scale-down.
- No automatic fault replacement.
- No topology optimizer.
- No model-layer rerouting policy.
- No host-side tensor data path except explicit fallback/debug modes.

### Proposed Deployment Shape

A first proposed deployment shape can stay minimal:

```json
{
  "name": "qwen3-14b-serving",
  "partitions": [
    {
      "name": "prefill",
      "model_range": {"layers": [0, 47]},
      "task_placement": "host",
      "runtime": "simpler",
      "execution_placement": "device",
      "parallelism": "batch",
      "replicas": 1
    },
    {
      "name": "decode",
      "model_range": {"layers": [0, 47]},
      "task_placement": "host",
      "runtime": "simpler",
      "execution_placement": "device",
      "parallelism": "batch",
      "replicas": 1
    }
  ],
  "channels": [
    {"name": "api_to_prefill", "producer": "api", "consumer": "prefill", "placement": "host", "kind": "control"},
    {"name": "prefill_to_decode", "producer": "prefill", "consumer": "decode", "placement": "device", "kind": "tensor"},
    {"name": "decode_to_api", "producer": "decode", "consumer": "api", "placement": "host", "kind": "control"}
  ],
  "heartbeat": {"enabled": true, "interval_ms": 1000, "tolerance_ms": 3000}
}
```

The platform API should eventually support richer partition metadata for pipeline parallelism, expert parallelism, batch parallelism, tensor parallelism, KV-cache locality, and NPU topology.

### Proposed Python API

The Python API can mirror the deployment shape with typed data classes. Exact field names can change, but the placement concepts should remain explicit.

```python
from dataclasses import dataclass
from typing import Literal


Placement = Literal["host", "device"]
ChannelKind = Literal["control", "health", "tensor"]


@dataclass(frozen=True)
class ModelRange:
    layers: tuple[int, int]


@dataclass(frozen=True)
class PartitionSpec:
    name: str
    model_range: ModelRange
    task_placement: Literal["host"]
    runtime: Literal["simpler"]
    execution_placement: Literal["device"]
    parallelism: str
    replicas: int


@dataclass(frozen=True)
class ChannelSpec:
    name: str
    producer: str
    consumer: str
    placement: Placement
    kind: ChannelKind
    capacity: int | None = None
    payload_size: int | None = None


@dataclass(frozen=True)
class HeartbeatSpec:
    enabled: bool
    interval_ms: int
    tolerance_ms: int


@dataclass(frozen=True)
class DeploymentSpec:
    name: str
    partitions: tuple[PartitionSpec, ...]
    channels: tuple[ChannelSpec, ...]
    heartbeat: HeartbeatSpec
```

### Platform Manager API

The first platform manager API should focus on startup, shutdown, and publishing resolved descriptors:

The Python-facing `PlatformManager` can be implemented as bindings over the existing C++ platform runtime, keeping Python as the serving/control API while C++ realizes deployment, host task launch, channel creation, and runtime supervision.

```python
class PlatformManager:
    def start(self, deployment: DeploymentSpec) -> "RuntimePlan": ...
    def stop(self) -> None: ...
    def get_runtime_plan(self) -> "RuntimePlan": ...
```

`start()` should validate the deployment, launch host-side tasks, start Simpler runtime instances, create channels with the requested placement, and return a resolved runtime plan.

`RuntimePlan` is the model layer's read-only view of what the platform created:

```python
@dataclass(frozen=True)
class RuntimeEndpoint:
    partition: str
    replica_id: str
    host_task_id: str
    simpler_instance_id: str
    device_id: str


@dataclass(frozen=True)
class ChannelHandle:
    name: str
    producer: str
    consumer: str
    placement: Placement
    kind: ChannelKind
    handle: object


@dataclass(frozen=True)
class RuntimePlan:
    endpoints: tuple[RuntimeEndpoint, ...]
    channels: tuple[ChannelHandle, ...]
```

Model support should consume `RuntimePlan` but should not mutate it. If a later platform version replaces a replica or channel, the platform publishes a new plan or an incremental update.

### Channel Creation API

For the static version, channel creation can be internal to `PlatformManager.start()`. If exposed, it should preserve placement:

```text
create_channel(source_partition, target_partition, channel_kind, placement, capacity, size)
delete_channel(channel_id)
subscribe(channel_id, message_type, handler)
publish(channel_id, message)
```

Channel kinds and placement should distinguish:

- Device payload channels for activations, token batches, logits, KV-related movement, and partition-to-partition tensor flow.
- Host control channels for coordinator/replica commands and responses.
- Host health channels for heartbeat and status events.

## Relation to PyPTO Serving Layers

### Service Entry Layer

Current files:

- `python/cli/main.py`

Expected platform relation:

- Add platform deployment configuration options.
- Choose local-only mode or distributed platform mode.
- Initialize the platform manager before starting model-serving traffic.

The default should remain local and minimal. Distributed platform mode should be explicit.

### HTTP API Layer

Current files:

- `python/core/server.py`

Expected platform relation:

- Continue handling OpenAI-compatible requests and health endpoints.
- Expose high-level service health that includes platform status when enabled.
- Avoid exposing internal platform controls through the public inference API unless explicitly needed.

### Async Serving Control Plane

Current files:

- `python/core/async_engine.py`
- `python/core/scheduler.py`
- `python/core/kv_cache.py`

Expected platform relation:

- The async engine should call the platform manager for coarse-grained deployment changes, not for each token step.
- The scheduler should keep deciding request batching and token budgets.
- KV-cache policy should remain model support, but it may provide placement hints such as KV locality and memory pressure.

### Model Execution Layer

Current files:

- `python/core/serving_worker.py`
- `python/core/executor.py`
- `python/core/pypto_executor.py`
- `examples/model/qwen3_14b/runner/`

Expected platform relation:

- Workers become replicas when platform mode is enabled.
- A partition replica owns a model subset and a local Simpler/PyPTO executor on the assigned device.
- The host-side task for a replica starts and supervises the Simpler runtime instance.
- Model execution happens on the Ascend device, not in the host platform task.
- The platform manager starts and stops replicas, but does not own model execution logic.

### Backend Runtime Layer

Current files:

- `pypto-lib/`
- Simpler runtime integration

Expected platform relation:

- Simpler remains the runtime for NPU dispatch.
- Platform code should not introduce a competing model runtime.
- Platform backend adapters can target local processes, MPI-based launch, cloud-style emulation, or Lingqu infrastructure, but the model-execution path should still terminate in PyPTO/Simpler.
- The Python platform API should call into the C++ platform runtime through bindings rather than reimplementing platform orchestration in Python.
- Tensor data channels between model partitions should be created on the device side when they carry model data.
- Host runtime code should only orchestrate Simpler instances and control-plane channels.

## Static Bootstrap Flow

Initial startup should follow this sequence:

```text
parse serving and deployment config
  -> initialize platform backend
  -> validate deployment graph
  -> allocate initial instances
  -> start partition coordinators
  -> start initial replicas
  -> create host-side control channels
  -> launch Simpler runtime instances from host-side tasks
  -> create device-side tensor payload channels
  -> initialize heartbeat and monitoring on the host control plane
  -> initialize PyPTO model executors inside device-side Simpler runtime instances
  -> mark deployment ready
  -> start accepting model requests
```

This matches the issue discussion: the platform submodule takes ownership immediately upon launch and relinquishes request execution to model support once initialization is ready.

## Future Platform APIs

Later platform work can extend the static API with coarse-grained management calls. These should remain outside the per-token hot path.

Potential future calls:

- `spawn_replica(partition, resources)` for dynamic scale-up or failure replacement.
- `drain_replica(replica)` for controlled scale-down.
- `remove_replica(replica)` after drain completes.
- `get_topology()` for topology-aware placement.
- `get_health()` for liveness snapshots.
- `report_load(metrics)` for scheduler and worker load signals.

The platform owns these decisions and publishes updated `RuntimePlan` data to model support after changes are applied.

## Future Failure Handling

Failure handling should build on the static deployment API. When a replica fails, the platform should mark it unhealthy, stop assigning work to it, create replacement capacity, rebuild host control channels and device tensor channels, and publish an updated `RuntimePlan` to model support. Model support can then resume, retry, or fail affected work according to model semantics.

Coordinator failure requires stronger state handling and is a later milestone. A minimal first implementation can treat coordinator failure as partition-level failure and restart the partition's coordinator and replicas.

## Non-Goals

- Do not turn PyPTO Serving into a full production serving framework.
- Do not put platform calls in the per-token execution hot path.
- Do not duplicate model-specific scheduling, KV-cache, or sampling policy inside the platform layer.
- Do not introduce a second NPU execution runtime alongside Simpler.
- Do not require distributed platform mode for the current single-node reference path.

## References

- GitHub issue: <https://github.com/hw-native-sys/pypto-serving/issues/13>
- PyPTO Serving architecture wiki: <https://github.com/hw-native-sys/pypto-serving/wiki/PyPTO-Serving-Architecture-Overview>
- PyPTO Serving design philosophy wiki: <https://github.com/hw-native-sys/pypto-serving/wiki/PyPTO-Serving-%E2%80%94-Design-Philosophy>


### Area

Executor or runtime

### Motivation / Use Case

Enhance PTO serving capabilities to support distributed execution

### Proposed API / Behavior

_No response_

### Alternatives Considered

_No response_

### Additional Context

_No response_

Submodule	Owner	Responsibility
Platform	VNL platform work	Start and scale the distributed system, create communication channels, manage replicas, monitor health, react to faults, and expose topology/resource changes to the model layer.
Model support	PyPTO Serving model layer	Handle LLM-specific behavior: request lifecycle, batching, model partition execution, KV cache policy, prefix reuse, token scheduling, sampling, and model-specific PyPTO executor logic.

Concept	Platform role	Meaning for PyPTO Serving
Deployment	Top-level desired state	Whole serving application, including partitions, edges, request manager, heartbeat settings, and control-buffer settings.
Partition	Logical execution stage	A shard/stage of model execution, for example a pipeline stage, expert group, prefill stage, or decode stage.
Replica	Concrete runtime instance	A concrete instance assigned to run a partition.
Task	Host-side launcher	The control function that starts and supervises a partition-local Simpler runtime instance.
Edge	Communication link	A channel between producers and consumers. Tensor payload edges should be device-side; control edges can be host-side.

Uh oh!

[Feature] Platform Management Design #32

Description

Summary

Purpose

Context

Design Split

Target Architecture

Feature Overview

Host and Device Boundary

Deployment Model

Coordinator and Replica Pattern

Channel Management

Health and Monitoring

Scaling and Placement

Supported Parallelism

Deep Dive: Static Deployment API

Goals

Non-Goals for the First API

Proposed Deployment Shape

Proposed Python API

Platform Manager API

Channel Creation API

Relation to PyPTO Serving Layers

Service Entry Layer

HTTP API Layer

Async Serving Control Plane

Model Execution Layer

Backend Runtime Layer

Static Bootstrap Flow

Future Platform APIs

Future Failure Handling

Non-Goals

References

Area

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions