[Feature] Parallel Strategies Support

### Summary

Define a first-class parallel strategy model for PyPTO Serving, covering data parallelism (DP), tensor parallelism (TP), expert parallelism (EP), pipeline parallelism (PP), and hybrid layouts such as DP+TP, TP+EP, and DP+TP+EP.

The design should follow the same high-level separation used by vLLM: one config object describes parallel dimensions, the runtime resolves rank groups/topology from it, and model runners consume group metadata without owning request routing or platform placement.


### Area

Executor or runtime

### Motivation / Use Case

PyPTO Serving needs a clear contract for multi-device and distributed serving. Without an explicit parallel strategy model, scheduler behavior, KV-cache ownership, PyPTO executor setup, worker placement, and communication channels can become model-specific and hard to compose.

  vLLM provides a useful reference:

  - `ParallelConfig` owns `pipeline_parallel_size`, `tensor_parallel_size`, `data_parallel_size`, DP rank/backend/LB mode, `enable_expert_parallel`, expert placement, and all2all
  backend choices.
  - Runtime code derives TP, PP, DP, and EP groups from the configured world layout.
  - DP is represented as independent engine/scheduler replicas, with a coordinator/load-balancing layer when needed.
  - TP/PP are model-execution dimensions inside one logical model replica.
  - EP is MoE-specific and depends on expert placement plus all2all communication backend selection.

  PyPTO Serving should define an equivalent serving-facing contract adapted to PyPTO/Simpler/L3 execution.


### Proposed API / Behavior

Add a parallel strategy section to serving config and model registration.

  Example shape:

  ```json
  {
    "parallel": {
      "data_parallel_size": 2,
      "tensor_parallel_size": 4,
      "pipeline_parallel_size": 1,
      "enable_expert_parallel": true,
      "expert_placement_strategy": "linear",
      "all2all_backend": "pypto_default",
      "data_parallel_backend": "local",
      "data_parallel_load_balance": "internal"
    }
  }
```

  Expected behavior:

  - DP creates multiple logical serving replicas. Each DP rank owns request scheduling, KV-cache state, prefix-cache metadata, and one executor group.
  - TP is one logical model executor split across multiple device ranks. Scheduler sees one worker, while the executor/model runner handles tensor shards and collectives.
  - PP splits model layers into ordered stages. Runtime/platform owns stage placement and activation channels; scheduler still owns request lifecycle.
  - EP is enabled only for MoE models. Model support owns token-to-expert routing semantics, while runtime/platform owns expert placement and all2all channels.
  - Hybrid layouts compose dimensions explicitly, for example DP replicas where each replica is a TP group.
  - Runtime should expose resolved group metadata to model runners, similar to vLLM’s TP/PP/DP/EP group accessors.
  - Platform code should own placement, process lifecycle, and channel creation, but should not own token scheduling or model-specific routing.

  Acceptance criteria:

  - A ParallelConfig or equivalent schema exists.
  - DP, TP, PP, EP, and hybrid ownership boundaries are documented.
  - Scheduler can route to logical workers without knowing tensor/expert sharding details.
  - Executor can initialize rank groups and pass group metadata into PyPTO model runners.
  - KV-cache and prefix-cache ownership are specified for DP, TP, PP, and EP.
  - Initial implementation may support only one mode, but the schema should not block later hybrid strategies.


### Alternatives Considered

Keep adding multi-device behavior directly inside model runners. This is simple initially, but it makes request routing, KV ownership, profiling, and platform placement hard to reuse.

Put all distributed behavior into the platform layer. This is too coarse: platform should manage lifecycle and placement, while model support owns scheduling, KV policy, sampling, and model-specific parallel execution.


### Additional Context

Related pypto-serving issues:

  - #18: L3 serving runtime design
  - #26: Replace L2 kernel runtime and old L3 generate path with unified L3 worker dispatch
  - #32: Platform management design


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Parallel Strategies Support #36

Summary

Area

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Parallel Strategies Support #36

Description

Summary

Area

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions