Skip to content

[Bug] hard-form pl.system.syncall deadlocks (AICore 507018) at partial core occupancy — no compile-time check #1935

Description

@lyfne123

Component

Codegen

Description

A kernel using the hard (FFTS) form of pl.system.syncall deadlocks on device with RuntimeError: run_prepared failed with code 507018 (AICore timeout) when the enclosing pl.spmd(N) launch does not fill all physical cores of the barrier's core_type. The hard barrier (SYNC_AIV_ONLY_ALL / mix) waits for every physical core of that type to arrive; the unlaunched cores never reach the barrier, so the FFTS wait never completes.

There is no compile-time diagnostic — the mistake only surfaces at runtime as a timeout, and it leaves the device draining/reset (subsequent runs on that device fail with code 13 until reset). The compiler already knows both the launch block count (pl.spmd(N)) and the target SoC's physical core count per core type, so this is statically checkable.

Steps to Reproduce

Minimal on-device repro (Ascend 910B = 48 physical AIV cores): a SPMD elementwise add with a hard pl.system.syncall(core_type="aiv_only") between the loads and the add, launched at partial occupancy. Mirrors tests/st/runtime/cross_core/test_syncall.py (whose docstring already documents that hard SYNCALL needs full AIV occupancy).

TR = TC = 128
N = 24  # < 48 physical AIV  -> PARTIAL occupancy (set N = 48 for the passing control)

@pl.program
class PartialSyncAll:
    @pl.function(type=pl.FunctionType.InCore)
    def add(self, a: pl.Tensor[[N*TR, TC], pl.FP32], b: pl.Tensor[[N*TR, TC], pl.FP32],
            out: pl.Out[pl.Tensor[[N*TR, TC], pl.FP32]]) -> pl.Tensor[[N*TR, TC], pl.FP32]:
        i = pl.tile.get_block_idx(); o = i * TR
        ta = pl.load(a, [o, 0], [TR, TC]); tb = pl.load(b, [o, 0], [TR, TC])
        pl.system.syncall(core_type="aiv_only")          # HARD barrier
        out = pl.store(pl.add(ta, tb), [o, 0], out); return out

    @pl.function(type=pl.FunctionType.Orchestration)
    def orchestrator(self, a, b, out):
        with pl.spmd(N):                                  # N=24 < 48 -> partial
            out = self.add(a, b, out)
        return out

Run on a2a3. pl.spmd(24) (partial) → 507018. Changing only pl.spmd(24)pl.spmd(48) (full occupancy) → PASS. Occupancy is the sole variable.

Expected Behavior

The compiler rejects a hard-mode syncall whose enclosing pl.spmd launch does not fill all physical cores of the barrier's core_type, with a clear compile-time error, e.g.:

hard pl.system.syncall(core_type="aiv_only") requires the spmd launch to fill all 48 AIV cores, but pl.spmd(24) launches 24 blocks. Use mode="soft" (GM-polling) for partial occupancy.

At minimum, a documented static diagnostic instead of a silent runtime 507018 + device reset.

Actual Behavior

Compiles silently; on device the run hangs until the FFTS wait times out and the device is force-reset:

[ERROR] sync_run_streams: aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
[ERROR] recover_device_or_mark_unusable: AICore error 507018: bounded device drain failed (force reset will follow in finalize)
RuntimeError: run_prepared failed with code 507018

The device is left unusable afterwards (later runs fail with code 13 until it recovers).

Environment

Component Version
pypto d598b41b (branch: main)
pypto runtime (submodule) 02bd0c4f
pto-isa e722679
ptoas 0.48
CANN not detected

Host Platform

Linux (aarch64)

NPU Kind

Ascend 910B

Additional Context

  • The soft form already works at partial occupancy: pl.system.syncall(mode="soft", core_type="aiv_only", gm_workspace=ws, used_cores=N) (see the *Soft* case in tests/st/runtime/cross_core/test_syncall.py). A compile-time check for the hard form would make the hard-vs-soft occupancy contract explicit and catch the footgun early.
  • Related: [Bug] dep_gen (DFX) instrumentation breaks full-occupancy pl.system.syncall -> AICore timeout 507018 #1931 (dep_gen instrumentation perturbing a full-occupancy syncall → 507018) — a distinct root cause; this issue is specifically the missing compile-time occupancy guard for the hard form.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions