Add PTODSL A5 DSL ST coverage#886
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces native tensor_view and partition_tensor_view folding support in the FoldTileBufIntrinsics pass, updates ExpandTileOp to include view shape and strides in the specialization key, and adds a pto_level parameter to @pto.jit to forward build-level overrides to ptoas. Additionally, VPTOSplitCVModule is updated to normalize sections in-place for pre-annotated modules. Feedback on the changes highlights a concurrency violation in FoldTileBufIntrinsics where a FuncOp pass queries the parent module's symbol table, a limitation in traceViewChain that fails on nested partitions, and an inefficient cleanup loop that should be optimized using a worklist-based dead code elimination approach.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codex Review该评论由 review 机器人自动更新。
SummaryReview failed at stage Findings未生成结构化 findings,因为 review 过程提前失败。 Log Tail |
9f6aa25 to
93ec308
Compare
| L0C_ADDR = 0 | ||
|
|
||
|
|
||
| @pto.cube |
There was a problem hiding this comment.
-
test/dsl-st/cube_matrix_pipeline.py:原来的 cube 用例写法和当前 PTODSL surface 已有漂移,CI 上会在 compile 路径失败;这里改成了当前主干稳定支持的 cube pipeline 写法,用显式的 L1/L0 搬运、matmul 和 writeback 来覆盖同一类能力。
-
test/dsl-st/gemv_mx_pipeline.py:中间一版手写插入了 _pto.TGetScaleAddrOp(...) 来补 MX scale 绑定,但这会让 CI 的 build-ptodsl 路径把pto.tget_scale_addr 送进错误的 ExpandTileOp 模板实例化并失败;这里回退成纯 PTODSL 的 pto.tile.gemv_mx* 写法,避免手写 raw IR。
-
test/dsl-st/predicate_pack.py:原来的写法把 psts/ppack/punpack 放在 @pto.simd helper 里,并通过 helper 中的 tile handle 取地址,CI 上会触发不稳定的 helper ABI/lowering 问题;这里把 predicate materialization 挪回顶层 vector body,并改用显式 UB ptr 做 vlds/psts,
-
|
|
||
| def compile(self, **constexpr_bindings): | ||
| compiled = self._compiler.compile(**constexpr_bindings) | ||
| _attach_flat_vpto_attrs(compiled.build(), self._compiler._module_spec) | ||
| return compiled |
There was a problem hiding this comment.
之前为了绕开nested container 写了 _flat_jit 和内部 KernelCompiler/KernelModuleSpec 调用问题。
现在都改回@pto.jit
|
|
||
| for (pto::AllocTileOp alloc : llvm::reverse(deadAllocs)) | ||
| alloc.erase(); | ||
| return !deadAllocs.empty(); |
There was a problem hiding this comment.
这里的改动和下面的lit用例拆成单独的commit吧,每个commit的修改尽量干净一些
| mkdir -p "${WORK_SPACE}" | ||
| WORK_SPACE="$(cd "${WORK_SPACE}" && pwd)" | ||
|
|
||
| has_torch_npu_packages() { |
There was a problem hiding this comment.
感觉这些环境问题我们不应该在每个脚本里都写兜底逻辑啊,如果CI环境有问题请王淼修一下吧。脚本里就应该假设所有环境都是ready的,可以在CI的入口统一setup下环境。
There was a problem hiding this comment.
现在 PTODSL source-backed case 只使用 CI/调用方显式传入的 PTO_PYTHON_BIN / PYTHON_BIN / PTO_DSL_ST_PYTHON_BIN,CI 入口负责选择并 export 可用 Python 环境
| isBackendPartitionedContainer(op) && | ||
| children.front()->hasAttr(mlir::pto::FunctionKernelKindAttr::name)) { | ||
| FailureOr<OwningOpRef<ModuleOp>> jobModuleOr = | ||
| buildBackendChildCompileUnit(op, children.front()); |
There was a problem hiding this comment.
这段逻辑是在干嘛,--mlir-print-ir-after可以dump 任意pass的输出,不需要特意写个debug入口吧
There was a problem hiding this comment.
这里的修改是为了解决:
PTODSL @pto.jit 生成的 single-child backend container:
module attributes {pto.target_arch = "a5"} {
module attributes {pto.backend = "vpto", pto.kernel_kind = #pto.kernel_kind, ...} {
func.func @tadd_f32_16x64(...) attributes {pto.entry} {
...
pto.tload ...
pto.tstore ...
}
}
}
PTODSL 会生成 outer module + one child module 的backend-partitioned container,ptoas driver 在 single-child backend-partitioned container 下,object 编译时没有把 child module 作为真正 VPTO compile unit。 在expandtileop 时会失败。
93ec308 to
99f140f
Compare
4b25c38 to
26da009
Compare
Abstract
This PR adds the first PTODSL-authored A5 DSL ST coverage and updates the PTODSL simulator CI path so the new cases are actually built and run.
The branch was rebuilt after dropping the earlier broad backend workaround. The remaining backend change is intentionally narrow:
FoldTileBufIntrinsicsnow performs fixpoint cleanup of dead view chains exposed after tile intrinsic folding. It deletes only use-empty bridge casts, memref view ops,pto.make_tensor_view/pto.partition_view, and dead tile allocations; it does not rerun fullPTOViewToMemref, does not broadenExpandTileOp, and does not change live view lowering semantics.Problem scenarios covered:
taddvalidates a basictload + tadd + tstorepath outside the oldtilelang_stharness.tload_storevalidates GM view construction,tload,tstore, and layout variants.tcolexpandandtcolsumcover non-trivial tile shapes, valid rows/cols, and tile-op expansion/runtime behavior.tmatmulvalidates a cube tile matmul path, while the existingcube_matrix_pipeline.pyandgemv_mx_pipeline.pyremain part of the simulator suite.mode="explicit"kernels need to compile through PTOAS level3 and should not implicitly enable sync insertion..soartifacts.@pto.simd/@pto.cubeshould not create redundant section wrappers; explicit kind mismatches should fail early with a clear diagnostic.make_tensor_view/partition_viewchains can otherwise leave high-level or memref view ops that later VPTO emission validation rejects.Implementation changes:
test/dsl-st/npu_a5cases fortadd,tload_store,tcolexpand,tcolsum, andtmatmul.predicate_pack.pyandvmulscvt.pyto avoid level3 live memref subviews.kernel_kindwas explicitly authored in@pto.jitwhile preserving the historical default effective kind ofvectorwhen omitted.mode="explicit"toptoas --pto-level=level3; keep explicit mode from implicitly enabling insert-sync.FoldTileBufIntrinsicsand a focused VPTO lit regression.tools/ptoas/driver.cppfor non-debug output paths.torch/torch_npuruntime instead of installing them each run, isolate PTODSL build artifacts, and ensuretest/dsl-st/npu_a5is covered.Validation
Validated on the 144 simulator environment under
/home/zhoujiaming/ptoas-sim-ci/pr886-cleanupusing the LLVM21 VPTO build and CANN simulator:Results:
llvm-lit:764/764passed.ptodsl/tests/test_jit_compile.py: passed.scripts/sim_dsl.sh --soc-version Ascend950PR_9599 test/dsl-st: all cases passed, includingcube_matrix_pipeline,gemv_mx_pipeline,predicate_pack,simt_gm_memory_core,vmulscvt, and the newnpu_a5directory coverage.Local checks:
Also checked that the final diff no longer touches the old broad-workaround files such as
ExpandTileOp.cpp,PTOInstantiateAndInlineOpLib.cpp,Passes.td,tools/ptoas/ptoas.cpp,VPTOOps.td,VPTO.cpp, orVPTOPtrNormalize.cpp.