Skip to content

Conversation

@amd-bartgips
Copy link
Contributor

@amd-bartgips amd-bartgips commented Aug 12, 2025

Copied over form branch in old repo:
ROCm/MIOpen#3923

Motivation

With this PR MIOpen should be able to use heuristics for 3D convolutions on gfx942:

  • the parameter selection/kernel tuning of the three conv_hip_implicit_gemm_3d_grouped_*_xdlops solvers.
    We have added some ai_* files housing the new models and helper functions, and made some changes to existing files where required.
  • the solver selection (i.e. a "3D tunanet") At the moment the 3D solver selection model is inadequate and the already existing WTI fallback is preferred. Improving the 3D solver selection heuristics will be the focus of a future PR

Technical Details

  • ai_heuristics.cpp contains all the code already there beforehand, plust new code that relies on fdeep includes, since fdeep can only be imported in a single file
  • ai_candidate_selection contains the actual two-towers (aka CandidateSelection) "model" and "metadata" classes that do the computation using floating point vectors.
  • ai_conv_3d_kernel_tuning_utils contains the machinery one level higher. That is, how to convert kernel configs and fdb_key input to float vectors and how to fetch and call the relevant CandidateSelectionModel. This is shared for all three solvers, so it made sense to centralise it in this file.
  • kernels/gfx942.... model and metadatafiles. Perhaps these should be committed using git lfs, but I have not seen this done to other model files (e.g. Tunanet or KTN), so have not done so here.
  • solver/conv/...cpp solver_specific files, they ultimately contain the solver-specific machinery. In this case our 3 solvers rely heavily on ai_conv_3d_kernel_tuning_utils and through that on ai_candidate_selection
  • gtest files: should speak for themselves, please have a look. They test all the new machinery
  • solvers.hpp a huge header file that contains declarations for all solvers (i.e. for the solver/conv/...cpp files), so this needed to be altered as well

Test Plan

3 new gtest .cpp files are added:

  • test/gtest/conv_ai_3d_heuristics.cpp
    This aims to test all new functionality related to the 3D tunanet (model + metadata).
    While the tests are still there for future use, they will be skipped since 3D Tunanet model data is no longer included.
  • test/gtest/conv_ai_3d_kernel_tuning_utils.cpp
    Aims to test all new machinery in ai_conv_3d_kernel_tuning_utils.cpp (preprocessing and handling of inputs to the CandidateSelectionModel for the 3D solvers).
  • test/gtest/conv_ai_candidate_selection_model.cpp
    Test interenal code related to the CandidateSelectionModel and its metadata.

Test Result

The ./bin/test_conv_ai_* tests all succeed without errors when building and running them on a conductor MI300 node.
Besides manually running all other ./bin/test_* functions, is there a better way to perform a full test?

Submission Checklist

@BradPepersAMD
Copy link
Contributor

Because we don't have performance testing to cover this, we need to verify the performance results manually and attach them to this PR so we can view it. I think the goal of this PR is that out of the box performance for 3D convs improves before and after this change so can we use MIOpenDriver with a set of 3D conv shapes to test this?

@cderb
Copy link
Contributor

cderb commented Sep 15, 2025

I don't see any obvious issues with the submission. I will echo the sentiment that it would be good to measure the expected performance uplift. Such as a % difference between heuristic selected and tuned.

@amd-bartgips
Copy link
Contributor Author

amd-bartgips commented Sep 16, 2025

regarding performance metrics, I received this list of conv ops from @jfactory07:
https://ontrack-internal.amd.com/secure/attachment/2947810/conv3d_example_miopen_cmd.txt

Which I then "extended" to cover all directions (not just fwd) and all datatypes. When I ran these MIOpendriver commands on MI308, it lead to this summarising figure:
image
Where the kernel timings labelled "develop" were gathered using the then current develop branch (at around Aug 25th, so containing the hand-crafted heuristics aimed at 16 bit fwd operations introduced by @jfactory07 earlier in this PR).

I gathered these data by forcing all the operations to go through the ConvHipImplicitGemm3DGroup*Xdlops solver (fwd, bwd, or wrw) and turning off the exhaustive tuning, such that we are only focussing on the kernel tuning performance here.
The three plots show histograms of ratios between kernel times for the two branches. Separated out for data type (fp32, fp16, bfp16) and direction. Lower numbers, mean that the "experimental" branch, i.e. "our" branch, is faster; a value of 1 (=10^0) means that both branches are equally fast. Note that most datapoints land below 1, except for the those in the fwd direction for 16 bit operations.

Based on these numbers we decided that:

  • As suggested in the above PR, the hand-crafted heuristics present in the develop branch work well for fp16 and bfp16 in the forward direction. The new heuristics model does not convincingly offer improvement there, so it has been turned off. That is, we stick with the status quo and for the purposes of this PR, the red histograms can be ignored.
  • For the other directions and datatypes, the hand-crafted heuristics do not work (/are not supposed to), so the Machine learning-based heuristics model shows a definite improvement, as such they are the main contribution of this PR.

@amd-bartgips
Copy link
Contributor Author

amd-bartgips commented Sep 16, 2025

Note that the above figure is made using MI308, not MI300.
I have not (yet) made an exact copy for MI300, but I have partially run a similar analysis:
image
the bottom figure is the most useful and directly comparable to the MI308 plots above. Note that this only contains the original convbfp16 ops provided by Jin (but with the other two directions added).

I hope this is enough to convince you of the improvements in this PR.
If not, let me know if you would like, e.g.:

  • different conv ops to be benchmarked.
  • me to calculate some kind of average improvement (instead of the histogram)
  • extend the analysis to fp16 and fp32 on MI300

@amd-bartgips
Copy link
Contributor Author

amd-bartgips commented Sep 19, 2025

The 3d conv fwd solver will now:

  1. check if there is an override index, if not:
  2. check if the hard-coded heuristics should be used (bf16, fp16 on gfx942), if it fails do not raise error, but log_i2 message and continue to (3)
  3. run AI heuristics (even for bf16 fp16 if step 2 failed)
  4. if all the above fails, use index 0

edit: I reran the kernel tuning benchmark for the current state of our branch on an MI308 node. For good measure I also pulled and built today's current develop branch.
It shows improvements across the board (but lowest for the fwd 16 bit cases). I added average numbers in the legends.
Note that the "experimental" branch now attempts to use the hard-coded heuristics for fp16/bfp16 that the develop branch also uses, but falls back to the AI heuristics if they fail. Hence the peak around 1 for these cases, but apparently it does still fail sometimes, leading to a slight improvement for these cases.
image

@amd-bartgips amd-bartgips merged commit 422e872 into develop Sep 22, 2025
8 checks passed
@amd-bartgips amd-bartgips deleted the silo/feat/3d_conv_heuristics branch September 22, 2025 13:29
assistant-librarian bot pushed a commit to ROCm/MIOpen that referenced this pull request Sep 22, 2025
[MIOpen] Implement kernel tuning heuristic model for 3D conv
 ops (two tower model) (#1154)

Copied over form branch in old repo:
#3923

## Motivation

With this PR MIOpen should be able to use heuristics for 3D convolutions
on gfx942:
* the parameter selection/kernel tuning of the three
`conv_hip_implicit_gemm_3d_grouped_*_xdlops` solvers.
We have added some `ai_*` files housing the new models and helper
functions, and made some changes to existing files where required.
* ~~the solver selection (i.e. a "3D tunanet")~~ At the moment the 3D
solver selection model is inadequate and the already existing WTI
fallback is preferred. Improving the 3D solver selection heuristics will
be the focus of a future PR

## Technical Details

* `ai_heuristics.cpp` contains all the code already there beforehand,
plust new code that relies on fdeep includes, since fdeep can only be
imported in a single file
* `ai_candidate_selection` contains the actual two-towers (aka
CandidateSelection) "model" and "metadata" classes that do the
computation using floating point vectors.
* `ai_conv_3d_kernel_tuning_utils` contains the machinery one level
higher. That is, how to convert kernel configs and fdb_key input to
float vectors and how to fetch and call the relevant
CandidateSelectionModel. This is shared for all three solvers, so it
made sense to centralise it in this file.
* `kernels/gfx942....` model and metadatafiles. Perhaps these should be
committed using git lfs, but I have not seen this done to other model
files (e.g. Tunanet or KTN), so have not done so here.
* `solver/conv/...cpp` solver_specific files, they ultimately contain
the solver-specific machinery. In this case our 3 solvers rely heavily
on `ai_conv_3d_kernel_tuning_utils` and through that on
`ai_candidate_selection`
* gtest files: should speak for themselves, please have a look. They
test all the new machinery
* solvers.hpp a huge header file that contains declarations for all
solvers (i.e. for the `solver/conv/...cpp` files), so this needed to be
altered as well

## Test Plan

3 new gtest .cpp files are added:
* ~~`test/gtest/conv_ai_3d_heuristics.cpp`
This aims to test all new functionality related to the 3D tunanet (model
+ metadata).~~ While the tests are still there for future use, they will
be skipped since 3D Tunanet model data is no longer included.
* `test/gtest/conv_ai_3d_kernel_tuning_utils.cpp`
Aims to test all new machinery in `ai_conv_3d_kernel_tuning_utils.cpp`
(preprocessing and handling of inputs to the CandidateSelectionModel for
the 3D solvers).
* `test/gtest/conv_ai_candidate_selection_model.cpp`
Test interenal code related to the CandidateSelectionModel and its
metadata.

## Test Result
The `./bin/test_conv_ai_*` tests all succeed without errors when
building and running them on a conductor MI300 node.
Besides manually running all other `./bin/test_*` functions, is there a
better way to perform a full test?

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
@rocm-devops
Copy link

Github action triggered OSDB jenkins job: http://rocm-ci.amd.com/job/compute-rocm-dkms-mathlibs-osdb/286

JonathanLichtnerAMD added a commit that referenced this pull request Sep 23, 2025
JonathanLichtnerAMD added a commit that referenced this pull request Sep 23, 2025
#1740)

…ops (two tower model) (#1154)"

This reverts commit 422e872.

## Motivation

That commit broke MI300 unit tests.

## Test Plan

Reverted this change and verified that the failing build now passes

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants