Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concretization error from communication segment followed by compute segment. #2481

Closed
cowanmeg opened this issue Jun 27, 2024 · 7 comments
Closed
Assignees

Comments

@cowanmeg
Copy link
Collaborator

cowanmeg commented Jun 27, 2024

I am seeing a concretization error where an IterDomain extent is not bound to a constant. This error only occurs when we have a reduction based collective (like allreduce/reduce scatter) followed by computation in another segment.

An example looks like this:
---- local matmul ----
T4_g[ ideviceIdx.x10{i0}, iS11{i3}, iS12{i6}, rS13{i2} ] (DeviceMesh{0 1})
= matmul(T3_l[ ideviceIdx.x7{i0}, iS9{i3}, iS8{i2} ] (DeviceMesh{0 1}),
T1_g[ ideviceIdx.x3{i4}, iS4{i5}, iS5{i6} ] (DeviceMesh{0 1}))

---- allreduce --- (dispatched to ProcessGroup)
T5_g[ rS14{i0}, iS15{i3}, iS16{i6} ] (DeviceMesh{0 1} = reduce([ideviceIdx.x3{i4}, iS4{i5}, iS5{i6} DeviceMesh{0 1}))

---- bias add --- Note: rS14{i0} is the Val that triggers the error.
%kernel {
T7_l[ iS19{i3}, iS20{i6} ] (DeviceMesh{0 1})
= __half2float(T5_g[ rS14{i0}, iS15{i3}, iS16{i6} ] (DeviceMesh{0 1}));
T6_l[ bS17{1}, iS18{i7} ] (DeviceMesh{0 1})
= broadcast( T2_l[ iS6{i7} ] (DeviceMesh{0 1}) )
T8_l[ bS21{1}, iS22{i7} ] (DeviceMesh{0 1})
= __half2float(T6_l[ bS17{1}, iS18{i7} ] (DeviceMesh{0 1}));
T9_g[ iS23{i3}, iS24{i6} ] (DeviceMesh{0 1})
= T7_l[ iS19{i3}, iS20{i6} ] (DeviceMesh{0 1})

  • T8_l[ bS21{1}, iS22{i7} ] (DeviceMesh{0 1});

Error:
C++ exception with description "ext_opt.hasValue() INTERNAL ASSERT FAILED at "/csrc/dynamic_transform.cpp":279, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Could not evaluate dynamic extent: i0
Exception raised from DynamicTransformConcretizationInfo at csrc/dynamic_transform.cpp:279 (most recent call first):

The allreduce is handled outside of FusionExecutorCache, so I assume this is an issue of piping information about i0's extent to the bias add segment since it cannot be inferred from the segment inputs? If something similar was already implemented for compute segments that take an input with a reduced IterDomain a pointer to it would be greatly appreciated :)

@wujingyue
Copy link
Collaborator

Can you give me a reproducer? I guess it involves some local changes to your test, so I wasn't sure how.

@cowanmeg
Copy link
Collaborator Author

Yes, here is a minimal test to reproduce: 00d93fa

@jacobhinkle
Copy link
Collaborator

I haven't run the repro yet, but it looks like the issue is that we cannot determine the extent i0, which we need in order to tell whether that dimension is a broadcast dimension or not, or whether it is potentially size 0. I wonder how we should infer i0 based on inputs. Note that we will need to be able to do that even if we don't need to concretize that axis, because the symbolic value i0 can be used in other expressions: you could divide a tensor by it for example. If all we have is the "no-devices" aten tensor then we must have some other mechanism like the NCCL world size that we can bind to i0 in an ExpressionEvaluator.

@cowanmeg
Copy link
Collaborator Author

Sorry, should have clarified. Running just the matmul+allreduce works (in fact we have this test https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_multidevice_matmul.cpp#L274) this failing test just tags an add at the end.
Some background - currently we infer i0 (or any DID axis in the inputs) by the DeviceMesh size (https://github.com/NVIDIA/Fuser/blob/main/csrc/expr_evaluator.cpp#L143).

The binding error only occurs on the bias add segment. IIUC in that segment, concretizing i0 isn't even necessary since it's reduced iter domain? Also, I bolded the wrong iter domain in the example which is now fixed!

@jacobhinkle
Copy link
Collaborator

Sorry, should have clarified. Running just the matmul+allreduce works (in fact we have this test https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_multidevice_matmul.cpp#L274) this failing test just tags an add at the end.

👍

Some background - currently we infer i0 (or any DID axis in the inputs) by the DeviceMesh size (https://github.com/NVIDIA/Fuser/blob/main/csrc/expr_evaluator.cpp#L143).

Great! I hadn't noticed that.

The binding error only occurs on the bias add segment. IIUC in that segment, concretizing i0 isn't even necessary since it's reduced iter domain? Also, I bolded the wrong iter domain in the example which is now fixed!

Ah! I think I understand now. Thank you for the explanation! One way to avoid errors like this is for us to strip off that annoying reduction domain for the segment input T5 when we do convertInputLogicalsToRoots (this could probably be renamed eraseRootsInInputs now). This also struck us here: #2317 (comment). I'll give it a try soon.

@cowanmeg
Copy link
Collaborator Author

Yes, that is the exact same error! Thank you!!

jacobhinkle added a commit that referenced this issue Jul 1, 2024
Currently, when we segment at the output of a reduction, the consumer
segment will have an input tensor that has a `Reduction` axis in it.
This can be problematic; see #2481 and #2317. This PR strips reduction
axes from the root and allocation domain in these cases.

---------

Co-authored-by: Jingyue Wu <[email protected]>
@cowanmeg
Copy link
Collaborator Author

cowanmeg commented Jul 1, 2024

Verified the PR worked for this test! Thank you @jacobhinkle!

@cowanmeg cowanmeg closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants