Skip to content

[Feature] Parallel Strategies Support #36

Description

@ndleslx

Summary

Define a first-class parallel strategy model for PyPTO Serving, covering data parallelism (DP), tensor parallelism (TP), expert parallelism (EP), pipeline parallelism (PP), and hybrid layouts such as DP+TP, TP+EP, and DP+TP+EP.

The design should follow the same high-level separation used by vLLM: one config object describes parallel dimensions, the runtime resolves rank groups/topology from it, and model runners consume group metadata without owning request routing or platform placement.

Area

Executor or runtime

Motivation / Use Case

PyPTO Serving needs a clear contract for multi-device and distributed serving. Without an explicit parallel strategy model, scheduler behavior, KV-cache ownership, PyPTO executor setup, worker placement, and communication channels can become model-specific and hard to compose.

vLLM provides a useful reference:

  • ParallelConfig owns pipeline_parallel_size, tensor_parallel_size, data_parallel_size, DP rank/backend/LB mode, enable_expert_parallel, expert placement, and all2all
    backend choices.
  • Runtime code derives TP, PP, DP, and EP groups from the configured world layout.
  • DP is represented as independent engine/scheduler replicas, with a coordinator/load-balancing layer when needed.
  • TP/PP are model-execution dimensions inside one logical model replica.
  • EP is MoE-specific and depends on expert placement plus all2all communication backend selection.

PyPTO Serving should define an equivalent serving-facing contract adapted to PyPTO/Simpler/L3 execution.

Proposed API / Behavior

Add a parallel strategy section to serving config and model registration.

Example shape:

{
  "parallel": {
    "data_parallel_size": 2,
    "tensor_parallel_size": 4,
    "pipeline_parallel_size": 1,
    "enable_expert_parallel": true,
    "expert_placement_strategy": "linear",
    "all2all_backend": "pypto_default",
    "data_parallel_backend": "local",
    "data_parallel_load_balance": "internal"
  }
}

Expected behavior:

  • DP creates multiple logical serving replicas. Each DP rank owns request scheduling, KV-cache state, prefix-cache metadata, and one executor group.
  • TP is one logical model executor split across multiple device ranks. Scheduler sees one worker, while the executor/model runner handles tensor shards and collectives.
  • PP splits model layers into ordered stages. Runtime/platform owns stage placement and activation channels; scheduler still owns request lifecycle.
  • EP is enabled only for MoE models. Model support owns token-to-expert routing semantics, while runtime/platform owns expert placement and all2all channels.
  • Hybrid layouts compose dimensions explicitly, for example DP replicas where each replica is a TP group.
  • Runtime should expose resolved group metadata to model runners, similar to vLLM’s TP/PP/DP/EP group accessors.
  • Platform code should own placement, process lifecycle, and channel creation, but should not own token scheduling or model-specific routing.

Acceptance criteria:

  • A ParallelConfig or equivalent schema exists.
  • DP, TP, PP, EP, and hybrid ownership boundaries are documented.
  • Scheduler can route to logical workers without knowing tensor/expert sharding details.
  • Executor can initialize rank groups and pass group metadata into PyPTO model runners.
  • KV-cache and prefix-cache ownership are specified for DP, TP, PP, and EP.
  • Initial implementation may support only one mode, but the schema should not block later hybrid strategies.

Alternatives Considered

Keep adding multi-device behavior directly inside model runners. This is simple initially, but it makes request routing, KV ownership, profiling, and platform placement hard to reuse.

Put all distributed behavior into the platform layer. This is too coarse: platform should manage lifecycle and placement, while model support owns scheduling, KV policy, sampling, and model-specific parallel execution.

Additional Context

Related pypto-serving issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions