Skip to content

[Bug] Flaky ST: test_dynamic_paged_attention[256-64-128-64-8192-32768] intermittently fails on a2a3 with AICore error 507018 #1840

Description

@lyfne123

Component

Other (please specify in description) — system test (tests/st/runtime/framework_and_models/) failing at on-device execution of the generated paged-attention kernel on the a2a3 backend/runtime.

Description

The system test tests/st/runtime/framework_and_models/test_dynamic_paged_attention.py::TestDynamicPagedAttentionKernels::test_dynamic_paged_attention[256-64-128-64-8192-32768] (test name dynamic_paged_attention_256bat_64h_128d_64bs; params batch=256, num_heads=64, head_dim=128, block_size=64, context_len=8192, max_model_len=32768) intermittently fails on a2a3 with AICore error code 507018.

When it fails, the device is marked unrecoverable and is force-reset. The failure is non-deterministic (~1/3 hit rate in local reproduction), so it occasionally turns the whole system-tests CI job red without any related code change.

This is not caused by a specific diff — it was hit while triaging PR #1834 (a Python-only runtime-default rename, zero .cpp/.h changes), and reproduces independently of that PR. The generated kernel is identical to main.

Steps to Reproduce

Run the single parametrization on an a2a3 device. A single run often passes; repeat it to surface the flakiness:

export PYTHONPATH=/path/to/pypto/python:$PYTHONPATH
python -m pytest \
  "tests/st/runtime/framework_and_models/test_dynamic_paged_attention.py::TestDynamicPagedAttentionKernels" \
  -k "256-64-128-64-8192-32768" \
  -v --device=<dev> --platform=a2a3 \
  --pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24 \
  --count=3

Observed locally (2026-06-23):

  • Single isolated run: PASSED (~72s).
  • --count=3: 1 failed / 2 passed (failed on the 3rd repetition with the same 507018 signature).

Expected Behavior

The test passes deterministically on every run; the device is not poisoned.

Actual Behavior

One repetition out of three fails:

RuntimeError: run_prepared failed with code 507018
  File ".../runtime/device_runner.py", line 638, in execute_on_device
    return worker.run(cid, orch_args, cfg)
  File ".../simpler/worker.py", line 2587, in run
  File ".../simpler/task_interface.py", line 544, in _run_slot
    return self._impl.run(int(callable_id), args, config)

[ERROR] sync_run_streams: [device_runner_base.cpp:881]
        aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
[ERROR] recover_device_or_mark_unusable: [device_runner.cpp:481]
        Device unrecoverable after AICore error 507018:
        aclrtSynchronizeDeviceWithTimeout failed. Marking DeviceRunner unusable;
        a soft in-process reset does not clear the poison.
[WARN]  force_reset_device: [device_runner.cpp:578]
        aclrtResetDeviceForce cleared the poisoned card

Git Commit ID

Reproduced on 1e18b0de (PR #1834 head, based on origin/main); not branch-specific — the kernel matches main.

NPU Kind

Other (please specify in description) — a2a3 (--platform=a2a3; npu-smi info reports the chip as Ascend910).

Host Platform

Linux (aarch64)

Additional Context

Related: #1388 (closed — flaky test_paged_attention_spmd_ptoas with the same 507018 device-unrecoverable signature), #1789 (open — same 507018 code but a different root cause). The shared 507018 + force-reset signature across multiple paged-attention ST variants suggests a common intermittent device/AICore fault rather than a per-test bug.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions