-
Notifications
You must be signed in to change notification settings - Fork 69
Support Split between logical domain to allocation domain to represent padding
#5184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Review updated until commit 2950054 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
clangformat
tests/cpp/test_layout_op.cpp
Outdated
| out->split(1, 16); | ||
| out->setAllocationDomain(out->getLoopDomain(), true); | ||
| // restore loop domain | ||
| out->merge(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't restore. Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Touche. It unsplit the loop domain so that it has the same size as logical domain.
You are right that the extent is no longer the same, so it's not a restoration.
Schedulers expects un-scheduled fusion. Without this merge, I'm hitting the assert here:
Fuser/csrc/scheduler/pointwise.cpp
Line 357 in db9721d
| NVF_ERROR(broadcast_bit_multiples.size() == ref_loop.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure that's good enough WAR, though this is just a test.
I thought the schedulers can work with some scheduled loop domains (for DID parallelization), not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fuser/csrc/scheduler/pointwise.cpp
Lines 231 to 233 in 12121b9
| // We always cacheBefore output at the beginning of the scheduling. And after | |
| // cacheBefore, the reference tensor will have all reduction IDs removed. | |
| ref_loop = TensorDomain::noDevices(TensorDomain::noReductions(ref_loop)); |
DID related IDs are just ignored by scheduler. So that's just too specific for multi-device.
I'm not a fan of this neither. Let me see if I can skip messing with loop and play transformation on allocation directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you can just modify the allocation domain with AbstractTensor. I remember there are some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also directly using IterDomain::split for that.
Anyway, looks like if the transformation is not on logical to loop, our replay wouldn't pick it up. Felt similar to the allocation domain replay that rfactor was missing. fyi @Priya2698
#0 nvfuser::nvfCheckFail (func=0xaaaaac218080 "validateDomainEquivalence",
file=0xaaaaac216938 "/opt/pytorch/nvfuser/csrc/ir/utils.cpp", line=1162,
msg=" INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. \nExpected !compare_result.dom0_has_u"...) at /opt/pytorch/nvfuser/csrc/exceptions.cpp:267
#1 0x0000aaaaab1bbe68 in nvfuser::nvfErrorFail (func=0xaaaaac218080 "validateDomainEquivalence",
file=0xaaaaac216938 "/opt/pytorch/nvfuser/csrc/ir/utils.cpp", line=1162,
condMsg=0xaaaaac217fd8 " INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ",
userMsg="Expected !compare_result.dom0_has_unreachable_ids . dom0 has unreachable IDs. dom0: iS10{i0}, iS11{i2}. dom1: iS10{i0}") at /opt/pytorch/nvfuser/csrc/exceptions.cpp:277
#2 0x0000aaaaab60a3e8 in nvfuser::ir_utils::validateDomainEquivalence (
dom0=std::vector of length 2, capacity 2 = {...}, dom1=std::vector of length 1, capacity 3 = {...},
additional_ids=std::vector of length 0, capacity 0) at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162
#3 0x0000aaaaab4aac30 in nvfuser::TensorDomain::setAllocationDomain (this=0xaaaab20918b0,
new_allocation_domain=std::vector of length 1, capacity 3 = {...},
new_contiguity=std::vector of length 1, capacity 3 = {...})
at /opt/pytorch/nvfuser/csrc/ir/nodes.cpp:4055
#4 0x0000aaaaabc7b368 in nvfuser::TransformReplay::replayCasP (consumer=0xaaaab2088c00,
producer=0xaaaab2091200, producer_pos=2, logical_map=..., opt=...)
at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:917
#5 0x0000aaaaabc7b7fc in nvfuser::TransformReplay::replayCasP (consumer=0xaaaab2088c00,
producer=0xaaaab2091200, compute_at_axis=-1, opt=...)
at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:945
#6 0x0000aaaaabc44ccc in nvfuser::TensorView::cacheBefore (this=0xaaaab2088c00,
op_type=nvfuser::LoadStoreOpType::Set) at /opt/pytorch/nvfuser/csrc/tensor_view.cpp:1160
#7 0x0000aaaaabbdb250 in nvfuser::scheduler_utils::cacheAndForkOutputs (fusion=0xaaaab2084910,
unroll=true) at /opt/pytorch/nvfuser/csrc/scheduler/utils.cpp:1357
#8 0x0000aaaaabb067dc in nvfuser::schedulePointwise (fusion=0xaaaab2084910, pparams=0xaaaab207f880)
at /opt/pytorch/nvfuser/csrc/scheduler/pointwise.cpp:822
#9 0x0000aaaaabb0898c in nvfuser::PointWiseScheduler::schedule (this=0xaaaab2083460,
fusion=0xaaaab2084910, params=0xaaaab207f880)
at /opt/pytorch/nvfuser/csrc/scheduler/pointwise.cpp:1304
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what did you decide to do? Nothing seems to have changed?
I can also directly using IterDomain::split for that.
Of course, but you'd need to maintain the proper ordering of the ID vector yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also directly using
IterDomain::splitfor that.Anyway, looks like if the transformation is not on logical to loop, our replay wouldn't pick it up. Felt similar to the allocation domain replay that rfactor was missing. fyi @Priya2698
Yes rfactor replay for allocation will also complain similarly if allocation transforms are disjoint from root-to-loop.
replayPasC also uses the loop domain as the target so if you intend to use IterDomain::split, we will have to update that, among other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. switched to selfReplay instead of replayCasP for TensorView::cacheBefore
| } | ||
| }; | ||
|
|
||
| TEST_F(LayoutOpTest, LogicalAndAllocationSizes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is being tested here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the relaxation in vectorization analysis, this test will trigger an assert.
So the test just verifies that we do allow allocation domain split now.
In the follow up PR, we added more validation to this test to check the produce tensor matches the logical sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good for the multidevice support part. I am not familiar enough with the requirements for LayoutOp, so I will defer to Naoya to approve the PR.
Is there an existing issue or doc detailing the LayoutOp design?
|
!test |
|
!test |
Sorry I don't have anything on that yet. I'll try to write up one when I have the end-2-end example working at least in a prototype. Mostly trying to wing it at this moment. |
|
!test |
csrc/transform_replay.cpp
Outdated
|
|
||
| // Replay loop. | ||
| if (self_loop != self->logical()) { | ||
| ReplaySelf replay(self_loop, axis_map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI: #4585 reversed this. I expect some tests to break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton. Let me sweep through failing tests and see if there's anything easy to patch. 🧑💼
|
!test |
|
!test |
| fusion.addOutput(out); | ||
| // padding output to multiple of 16 on allocation domain | ||
| auto&& [io, ii] = IterDomain::split( | ||
| out->axis(1), IrBuilder::create<Val>(16L, DataType::Index), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tagging @naoyam changed the test to only apply split on logical -> allocation.
|
!test |
|
errr. seeing wrong results coming from That's pretty scary. Keep digging. |
|
!test |
|
I need to double check the tensor layout produced in the tests.
|
|
|
||
| // Parallelize type could include device from split. | ||
| ido->parallelize(s->outer()->getParallelType()); | ||
| idi->parallelize(s->inner()->getParallelType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wujingyue tagging you to try this guy out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
!test |
|
For my own record.
|
|
I'm shelving this PR for now. We have changed the approach taking here on how cacheBefore needs to be handled, which is causing too many test breakage that's tricky to debug at this time. It's not blocking my grouped_mm layout support, so I'll jump back to this when the stack of PRs on that has been cleaned. |
| bool only_valid_device_split = true; | ||
| for (Expr* expr : exprs | std::views::reverse) { | ||
| validateDeviceSplit(expr); | ||
| if (!isValidDeviceSplit(expr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@protonu You might need this relaxed. (things coming from vectorize_helper.cpp and multidevice/...
I'll start a PR on the side for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - For now, I modified it here:
https://github.com/NVIDIA/Fuser/pull/5322/files
Cherry-picked from #5184 non-divisible split between logical->allocation domain could be used to represent padding.
|
!test |
|
Ha I'm seeing GB200 with Which is a lot more promising than the wrong result to poke at. 🦅 |
|
😢 This issue showed up during shape inference to figure out output buffer size for allocation. vvv |
|
!test |
| // NOTE: this doesn't feel right, we have to mark contiguity on axis(0) as | ||
| // `false` to avoid accidntal indexing collapsing, this should be figured out | ||
| // by indexing from the ceilDiv. | ||
| out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding this issue correctly?
a) The tensor actually is contiguous with respect to this allocation domain, which has size M, ceilDiv(K, 16), 16.
b) The tensor winds up not being contiguous with respect to its logical domain which is of size M, K, because the nondivisible split adds some padding to K.
b) By "indexing collapsing" you mean it does contiguous indexing so that stride is not part of the index? Is that wrong? It seems like indexing as contiguous allocation is what we want here.
My question is what specifically goes wrong when allocation is set to contiguous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are absolutely correct about a) and b).
Indexing collapsing is wrong here, because we are mapping from logical to allocation, which is not accessing contiguous memory (because of non-divisible split).
This is the before and after of the indexing.
with false contiguity flag
root@812ada01cb39:/opt/pytorch/nvfuser# NVFUSER_DUMP=cuda_kernel ./bin/test_layout_op --gtest_filter="*LogicalAndAllocationSizes"
Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *LogicalAndAllocationSizes
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LayoutOpTest
[ RUN ] LayoutOpTest.LogicalAndAllocationSizes
======= Codegen output for kernel: nvfuser_pointwise_f0_c1_r0_g0 =======
// Codegen generated code
__global__ void nvfuser_pointwise_f0_c1_r0_g0(Tensor<float, 2, 2> T0, Tensor<float, 2, 3> T1) {
nvfuser_index_t i0;
i0 = ((nvfuser_index_t)threadIdx.x) + (128LL * ((nvfuser_index_t)blockIdx.x));
nvfuser_index_t i1;
i1 = i0 % T0.logical_size[1LL];
nvfuser_index_t i2;
i2 = i0 / T0.logical_size[1LL];
if ((i0 < (T0.logical_size[0LL] * T0.logical_size[1LL]))) {
Array<float, 1LL, 1> T2;
T2[0LL] = 0LL;
T2[0LL]
= T0[((T0.alloc_stride[0LL] * i2) + (T0.alloc_stride[1LL] * i1))];
Array<float, 1LL, 1> T3;
T3[0LL]
= T2[0LL];
T1[(i1 + (T1.alloc_stride[0LL] * i2))]
= T3[0LL];
}
}
======================================
[ OK ] LayoutOpTest.LogicalAndAllocationSizes (966 ms)
[----------] 1 test from LayoutOpTest (966 ms total)
with true contiguity flag
root@558d9dfeefb8:/opt/pytorch/nvfuser# NVFUSER_DUMP=cuda_kernel ./bin/test_layout_op --gtest_filter="*LogicalAndAllocationSizes"
Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *LogicalAndAllocationSizes
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LayoutOpTest
[ RUN ] LayoutOpTest.LogicalAndAllocationSizes
======= Codegen output for kernel: nvfuser_pointwise_f0_c1_r0_g0 =======
// Codegen generated code
__global__ void nvfuser_pointwise_f0_c1_r0_g0(Tensor<float, 2, 2> T0, Tensor<float, 2, 3> T1) {
nvfuser_index_t i0;
i0 = ((nvfuser_index_t)threadIdx.x) + (128 * ((nvfuser_index_t)blockIdx.x));
if ((i0 < (T0.logical_size[0LL] * T0.logical_size[1LL]))) {
Array<float, 1, 1> T2;
T2[0] = 0;
T2[0]
= T0[((T0.alloc_stride[0LL] * (i0 / T0.logical_size[1LL])) + (T0.alloc_stride[1LL] * (i0 % T0.logical_size[1LL])))];
Array<float, 1, 1> T3;
T3[0]
= T2[0];
T1[i0]
= T3[0];
}
}
======================================
/opt/pytorch/nvfuser/tests/cpp/test_layout_op.cpp:128: Failure
Value of: t0.equal(cg_outputs[0].as<at::Tensor>().slice(1, 0, k))
Actual: false
Expected: true
Stacked PRs
Breaking original PR #5170 into three:
#5186 Fix allocation logic: non-divisible split
#5185 Fix allocation logic: unconnected alloc/logical
#5184 Allow split on logical->allocation <- this one
This PR
Allows split of ID on the path
logical->allocationto represent padding logic on allocation. Notably, we no longer require allocation domain on the path betweenlogical->looplogical->allocation. Without the extra replay, TensorView::cacheBefore would alter the semantics by changing allocation domain of outputs.TODO: