Component
Other (please specify in description) — system test (tests/st/runtime/framework_and_models/) failing at on-device execution of the generated paged-attention kernel on the a2a3 backend/runtime.
Description
The system test tests/st/runtime/framework_and_models/test_dynamic_paged_attention.py::TestDynamicPagedAttentionKernels::test_dynamic_paged_attention[256-64-128-64-8192-32768] (test name dynamic_paged_attention_256bat_64h_128d_64bs; params batch=256, num_heads=64, head_dim=128, block_size=64, context_len=8192, max_model_len=32768) intermittently fails on a2a3 with AICore error code 507018.
When it fails, the device is marked unrecoverable and is force-reset. The failure is non-deterministic (~1/3 hit rate in local reproduction), so it occasionally turns the whole system-tests CI job red without any related code change.
This is not caused by a specific diff — it was hit while triaging PR #1834 (a Python-only runtime-default rename, zero .cpp/.h changes), and reproduces independently of that PR. The generated kernel is identical to main.
Steps to Reproduce
Run the single parametrization on an a2a3 device. A single run often passes; repeat it to surface the flakiness:
export PYTHONPATH=/path/to/pypto/python:$PYTHONPATH
python -m pytest \
"tests/st/runtime/framework_and_models/test_dynamic_paged_attention.py::TestDynamicPagedAttentionKernels" \
-k "256-64-128-64-8192-32768" \
-v --device=<dev> --platform=a2a3 \
--pto-isa-commit=016396b57e2c17093f1194e6acd89bb112b0ab24 \
--count=3
Observed locally (2026-06-23):
- Single isolated run: PASSED (~72s).
--count=3: 1 failed / 2 passed (failed on the 3rd repetition with the same 507018 signature).
Expected Behavior
The test passes deterministically on every run; the device is not poisoned.
Actual Behavior
One repetition out of three fails:
RuntimeError: run_prepared failed with code 507018
File ".../runtime/device_runner.py", line 638, in execute_on_device
return worker.run(cid, orch_args, cfg)
File ".../simpler/worker.py", line 2587, in run
File ".../simpler/task_interface.py", line 544, in _run_slot
return self._impl.run(int(callable_id), args, config)
[ERROR] sync_run_streams: [device_runner_base.cpp:881]
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
[ERROR] recover_device_or_mark_unusable: [device_runner.cpp:481]
Device unrecoverable after AICore error 507018:
aclrtSynchronizeDeviceWithTimeout failed. Marking DeviceRunner unusable;
a soft in-process reset does not clear the poison.
[WARN] force_reset_device: [device_runner.cpp:578]
aclrtResetDeviceForce cleared the poisoned card
Git Commit ID
Reproduced on 1e18b0de (PR #1834 head, based on origin/main); not branch-specific — the kernel matches main.
NPU Kind
Other (please specify in description) — a2a3 (--platform=a2a3; npu-smi info reports the chip as Ascend910).
Host Platform
Linux (aarch64)
Additional Context
Related: #1388 (closed — flaky test_paged_attention_spmd_ptoas with the same 507018 device-unrecoverable signature), #1789 (open — same 507018 code but a different root cause). The shared 507018 + force-reset signature across multiple paged-attention ST variants suggests a common intermittent device/AICore fault rather than a per-test bug.
Component
Other (please specify in description) — system test (
tests/st/runtime/framework_and_models/) failing at on-device execution of the generated paged-attention kernel on the a2a3 backend/runtime.Description
The system test
tests/st/runtime/framework_and_models/test_dynamic_paged_attention.py::TestDynamicPagedAttentionKernels::test_dynamic_paged_attention[256-64-128-64-8192-32768](test namedynamic_paged_attention_256bat_64h_128d_64bs; paramsbatch=256, num_heads=64, head_dim=128, block_size=64, context_len=8192, max_model_len=32768) intermittently fails on a2a3 with AICore error code 507018.When it fails, the device is marked unrecoverable and is force-reset. The failure is non-deterministic (~1/3 hit rate in local reproduction), so it occasionally turns the whole
system-testsCI job red without any related code change.This is not caused by a specific diff — it was hit while triaging PR #1834 (a Python-only runtime-default rename, zero
.cpp/.hchanges), and reproduces independently of that PR. The generated kernel is identical tomain.Steps to Reproduce
Run the single parametrization on an a2a3 device. A single run often passes; repeat it to surface the flakiness:
Observed locally (2026-06-23):
--count=3: 1 failed / 2 passed (failed on the 3rd repetition with the same 507018 signature).Expected Behavior
The test passes deterministically on every run; the device is not poisoned.
Actual Behavior
One repetition out of three fails:
Git Commit ID
Reproduced on
1e18b0de(PR #1834 head, based onorigin/main); not branch-specific — the kernel matchesmain.NPU Kind
Other (please specify in description) — a2a3 (
--platform=a2a3;npu-smi inforeports the chip asAscend910).Host Platform
Linux (aarch64)
Additional Context
Related: #1388 (closed — flaky
test_paged_attention_spmd_ptoaswith the same 507018 device-unrecoverable signature), #1789 (open — same 507018 code but a different root cause). The shared 507018 + force-reset signature across multiple paged-attention ST variants suggests a common intermittent device/AICore fault rather than a per-test bug.