-
Notifications
You must be signed in to change notification settings - Fork 155
[MIOpen] bugfix: Conv 3d AI kernel tuning; kernel does not exist #1748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MIOpen] bugfix: Conv 3d AI kernel tuning; kernel does not exist #1748
Conversation
projects/miopen/src/solver/conv/conv_hip_implicit_gemm_3d_grouped_fwd_xdlops.cpp
Outdated
Show resolved
Hide resolved
…ar cannot be set dynamically in tests.
|
Since the revert in PR #1740 it is a bit more difficult to see the changes, but the main changes are visible in the first commit of this PR (when the revert was not yet performed). This PR gets rid of most of the failing test functions. But when I run the full batch of test functions using I have attached a log showing all the failing tests (with and without detailed MIOpen logs).
So to me it seems that these errors are not really related to the code in this PR, but rather something related to the kernels themselves? test_results_miopendetails.log Can anyone comment on this? |
projects/miopen/src/conv/heuristics/ai_conv_3d_kernel_tuning_utils.cpp
Outdated
Show resolved
Hide resolved
projects/miopen/src/conv/heuristics/ai_conv_3d_kernel_tuning_utils.cpp
Outdated
Show resolved
Hide resolved
|
Thanks a lot for the review @reidkwja! I have implemented your suggestions. After pulling the latest changes from develop and building from scratch I reran all the tests again using
These tests seem to fail, because the kernel that the AI model selects fails, leading to the GPU kernel returning all zeros, e.g.: or, in 2 out of 16 cases, it returns a non-zero result that is simply not accurate enough: (see gtest.log attached for the full logs for If my analysis is correct this will be not trivial to "solve". These errors will not show up when the heuristics are disabled since the culprit kernels will simply not be selected (unless you do full tuning, but then the genericsearch function can gracefully deal with these errors). However, I imagine we will not want these errors to keep failing in our CI/CD pipeline. We could:
If these kernels really are broken I don't think it's the job of the AI heuristics to also predict this (and avoid them), since that would make the AI model a lot more complex (right now it only focusses on predicting which of the available kernel configs is the fastest for a particular conv op). What do others think? Edit: During a conversation with @vpietila-amd , he suggested that the issue perhaps lies with the |
… Removed blacklisting
|
After a lot of digging and help from @vpietila-amd the conclusion was that 14/16 of the failing tests were the result of the CK workspace allocation not going correctly (and ending up with 0 / nullptr). This caused the CK kernel to fail. I have now increased the threshold slightly (to 4e-3, see comment above) is this a good solution? I ran the full test suite locally now and there are no more failed tests. |
…var to top of file to avoid namespace conflicts there.
…ound relevant functions
|
I managed to fix the issues with the Azure pipeline (in that pipeline the compilation for the MIOPEN_ENABLE_AI_KERNEL_TUNING was turned off. These guards were not placed properly in all 3 of the new versions of the 3d conv solver .cpp files. I don't believe this PR really affects this particular solver, so it looks like this is an issue with that particular CK kernel (and corresponding solver). In other words, I think this PR is ready now :) |
…p stub implementation ("
This reverts commit 5547f19.
[MIOpen] bugfix: Conv 3d AI kernel tuning; kernel does not exist (#1748) ## Motivation Bugfix to avoid reverting as suggested in #1740 . MI300 test pipeline shows test failures in: projects/miopen/test/gtest/group_conv3d_bwd.cpp projects/miopen/test/gtest/group_conv3d_fwd.cpp projects/miopen/test/gtest/group_conv3d_wrw.cpp ## Technical Details The tests failed because of errors such as: `MIOpen(HIP): Error [InitInvokerFactoryNHWC] PerformanceConfig kernel 'DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle<256, 64, 64, 32, Default, 16, 16, 2, 2, 1, 2, 1, 1, 1, 1>' does not exist.` This was caused by the run_ai_heuristics functions not properly initialising the valid_kernels. It did not take into account `problem.GetAlphaBetaCase()`, so these errors could occur in the BILINEAR and SCALE cases. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Motivation
Bugfix to avoid reverting as suggested in #1740 .
MI300 test pipeline shows test failures in:
projects/miopen/test/gtest/group_conv3d_bwd.cpp
projects/miopen/test/gtest/group_conv3d_fwd.cpp
projects/miopen/test/gtest/group_conv3d_wrw.cpp
Technical Details
The tests failed because of errors such as:
MIOpen(HIP): Error [InitInvokerFactoryNHWC] PerformanceConfig kernel 'DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle<256, 64, 64, 32, Default, 16, 16, 2, 2, 1, 2, 1, 1, 1, 1>' does not exist.This was caused by the run_ai_heuristics functions not properly initialising the valid_kernels.
It did not take into account
problem.GetAlphaBetaCase(), so these errors could occur in the BILINEAR and SCALE cases.Test Plan
Test Result
Submission Checklist