[MIOpen] Implement kernel tuning heuristic model for 3D conv ops (two tower model) #1154

amd-bartgips · 2025-08-12T09:45:03Z

Copied over form branch in old repo:
ROCm/MIOpen#3923

Motivation

With this PR MIOpen should be able to use heuristics for 3D convolutions on gfx942:

the parameter selection/kernel tuning of the three conv_hip_implicit_gemm_3d_grouped_*_xdlops solvers.
We have added some ai_* files housing the new models and helper functions, and made some changes to existing files where required.
~~the solver selection (i.e. a "3D tunanet")~~ At the moment the 3D solver selection model is inadequate and the already existing WTI fallback is preferred. Improving the 3D solver selection heuristics will be the focus of a future PR

Technical Details

ai_heuristics.cpp contains all the code already there beforehand, plust new code that relies on fdeep includes, since fdeep can only be imported in a single file
ai_candidate_selection contains the actual two-towers (aka CandidateSelection) "model" and "metadata" classes that do the computation using floating point vectors.
ai_conv_3d_kernel_tuning_utils contains the machinery one level higher. That is, how to convert kernel configs and fdb_key input to float vectors and how to fetch and call the relevant CandidateSelectionModel. This is shared for all three solvers, so it made sense to centralise it in this file.
kernels/gfx942.... model and metadatafiles. Perhaps these should be committed using git lfs, but I have not seen this done to other model files (e.g. Tunanet or KTN), so have not done so here.
solver/conv/...cpp solver_specific files, they ultimately contain the solver-specific machinery. In this case our 3 solvers rely heavily on ai_conv_3d_kernel_tuning_utils and through that on ai_candidate_selection
gtest files: should speak for themselves, please have a look. They test all the new machinery
solvers.hpp a huge header file that contains declarations for all solvers (i.e. for the solver/conv/...cpp files), so this needed to be altered as well

Test Plan

3 new gtest .cpp files are added:

test/gtest/conv_ai_3d_heuristics.cpp
This aims to test all new functionality related to the 3D tunanet (model + metadata). While the tests are still there for future use, they will be skipped since 3D Tunanet model data is no longer included.
test/gtest/conv_ai_3d_kernel_tuning_utils.cpp
Aims to test all new machinery in ai_conv_3d_kernel_tuning_utils.cpp (preprocessing and handling of inputs to the CandidateSelectionModel for the 3D solvers).
test/gtest/conv_ai_candidate_selection_model.cpp
Test interenal code related to the CandidateSelectionModel and its metadata.

Test Result

The ./bin/test_conv_ai_* tests all succeed without errors when building and running them on a conductor MI300 node.
Besides manually running all other ./bin/test_* functions, is there a better way to perform a full test?

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…el call to make HeuristicInit less verbose and easier to read.

…less ambiguous

…re used for expanding the candidate kernel params for the CandidateSelectionModel. Note: not tested yet.

projects/miopen/src/solver/conv/conv_hip_implicit_gemm_3d_grouped_fwd_xdlops.cpp

projects/miopen/src/kernels/gfx942_3d.tn.model

projects/miopen/src/solver/conv/conv_hip_implicit_gemm_3d_grouped_fwd_xdlops.cpp

…e file (not in header)

…t back to the correct place in the .hpp header

BradPepersAMD · 2025-09-15T14:54:34Z

Because we don't have performance testing to cover this, we need to verify the performance results manually and attach them to this PR so we can view it. I think the goal of this PR is that out of the box performance for 3D convs improves before and after this change so can we use MIOpenDriver with a set of 3D conv shapes to test this?

cderb · 2025-09-15T20:34:53Z

I don't see any obvious issues with the submission. I will echo the sentiment that it would be good to measure the expected performance uplift. Such as a % difference between heuristic selected and tuned.

amd-bartgips · 2025-09-16T08:11:04Z

regarding performance metrics, I received this list of conv ops from @jfactory07:
https://ontrack-internal.amd.com/secure/attachment/2947810/conv3d_example_miopen_cmd.txt

Which I then "extended" to cover all directions (not just fwd) and all datatypes. When I ran these MIOpendriver commands on MI308, it lead to this summarising figure:

Where the kernel timings labelled "develop" were gathered using the then current develop branch (at around Aug 25th, so containing the hand-crafted heuristics aimed at 16 bit fwd operations introduced by @jfactory07 earlier in this PR).

I gathered these data by forcing all the operations to go through the ConvHipImplicitGemm3DGroup*Xdlops solver (fwd, bwd, or wrw) and turning off the exhaustive tuning, such that we are only focussing on the kernel tuning performance here.
The three plots show histograms of ratios between kernel times for the two branches. Separated out for data type (fp32, fp16, bfp16) and direction. Lower numbers, mean that the "experimental" branch, i.e. "our" branch, is faster; a value of 1 (=10^0) means that both branches are equally fast. Note that most datapoints land below 1, except for the those in the fwd direction for 16 bit operations.

Based on these numbers we decided that:

As suggested in the above PR, the hand-crafted heuristics present in the develop branch work well for fp16 and bfp16 in the forward direction. The new heuristics model does not convincingly offer improvement there, so it has been turned off. That is, we stick with the status quo and for the purposes of this PR, the red histograms can be ignored.
For the other directions and datatypes, the hand-crafted heuristics do not work (/are not supposed to), so the Machine learning-based heuristics model shows a definite improvement, as such they are the main contribution of this PR.

amd-bartgips · 2025-09-16T08:15:14Z

Note that the above figure is made using MI308, not MI300.
I have not (yet) made an exact copy for MI300, but I have partially run a similar analysis:

the bottom figure is the most useful and directly comparable to the MI308 plots above. Note that this only contains the original convbfp16 ops provided by Jin (but with the other two directions added).

I hope this is enough to convince you of the improvements in this PR.
If not, let me know if you would like, e.g.:

different conv ops to be benchmarked.
me to calculate some kind of average improvement (instead of the histogram)
extend the analysis to fp16 and fp32 on MI300

projects/miopen/src/solver/conv/conv_hip_implicit_gemm_3d_grouped_fwd_xdlops.cpp

…ree solvers for clearer logs (e.g. changed error msg for hard-code heuristics into log_i2)

amd-bartgips · 2025-09-19T08:47:37Z

The 3d conv fwd solver will now:

check if there is an override index, if not:
check if the hard-coded heuristics should be used (bf16, fp16 on gfx942), if it fails do not raise error, but log_i2 message and continue to (3)
run AI heuristics (even for bf16 fp16 if step 2 failed)
if all the above fails, use index 0

edit: I reran the kernel tuning benchmark for the current state of our branch on an MI308 node. For good measure I also pulled and built today's current develop branch.
It shows improvements across the board (but lowest for the fwd 16 bit cases). I added average numbers in the legends.
Note that the "experimental" branch now attempts to use the hard-coded heuristics for fp16/bfp16 that the develop branch also uses, but falls back to the AI heuristics if they fail. Hence the peak around 1 for these cases, but apparently it does still fail sometimes, leading to a slight improvement for these cases.

[MIOpen] Implement kernel tuning heuristic model for 3D conv ops (two tower model) (#1154) Copied over form branch in old repo: #3923 ## Motivation With this PR MIOpen should be able to use heuristics for 3D convolutions on gfx942: * the parameter selection/kernel tuning of the three `conv_hip_implicit_gemm_3d_grouped_*_xdlops` solvers. We have added some `ai_*` files housing the new models and helper functions, and made some changes to existing files where required. * ~~the solver selection (i.e. a "3D tunanet")~~ At the moment the 3D solver selection model is inadequate and the already existing WTI fallback is preferred. Improving the 3D solver selection heuristics will be the focus of a future PR ## Technical Details * `ai_heuristics.cpp` contains all the code already there beforehand, plust new code that relies on fdeep includes, since fdeep can only be imported in a single file * `ai_candidate_selection` contains the actual two-towers (aka CandidateSelection) "model" and "metadata" classes that do the computation using floating point vectors. * `ai_conv_3d_kernel_tuning_utils` contains the machinery one level higher. That is, how to convert kernel configs and fdb_key input to float vectors and how to fetch and call the relevant CandidateSelectionModel. This is shared for all three solvers, so it made sense to centralise it in this file. * `kernels/gfx942....` model and metadatafiles. Perhaps these should be committed using git lfs, but I have not seen this done to other model files (e.g. Tunanet or KTN), so have not done so here. * `solver/conv/...cpp` solver_specific files, they ultimately contain the solver-specific machinery. In this case our 3 solvers rely heavily on `ai_conv_3d_kernel_tuning_utils` and through that on `ai_candidate_selection` * gtest files: should speak for themselves, please have a look. They test all the new machinery * solvers.hpp a huge header file that contains declarations for all solvers (i.e. for the `solver/conv/...cpp` files), so this needed to be altered as well ## Test Plan 3 new gtest .cpp files are added: * ~~`test/gtest/conv_ai_3d_heuristics.cpp` This aims to test all new functionality related to the 3D tunanet (model + metadata).~~ While the tests are still there for future use, they will be skipped since 3D Tunanet model data is no longer included. * `test/gtest/conv_ai_3d_kernel_tuning_utils.cpp` Aims to test all new machinery in `ai_conv_3d_kernel_tuning_utils.cpp` (preprocessing and handling of inputs to the CandidateSelectionModel for the 3D solvers). * `test/gtest/conv_ai_candidate_selection_model.cpp` Test interenal code related to the CandidateSelectionModel and its metadata. ## Test Result The `./bin/test_conv_ai_*` tests all succeed without errors when building and running them on a conductor MI300 node. Besides manually running all other `./bin/test_*` functions, is there a better way to perform a full test? ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

rocm-devops · 2025-09-22T13:32:51Z

Github action triggered OSDB jenkins job: http://rocm-ci.amd.com/job/compute-rocm-dkms-mathlibs-osdb/286

…ops (two tower model) (#1154)" This reverts commit 422e872.

#1740) …ops (two tower model) (#1154)" This reverts commit 422e872. ## Motivation That commit broke MI300 unit tests. ## Test Plan Reverted this change and verified that the failing build now passes ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copied over all changes from old miopen repo branch

b39db7f

github-actions bot added the project: miopen label Aug 12, 2025

assistant-librarian bot added the organization: ROCm label Aug 12, 2025

amd-bartgips added 10 commits August 12, 2025 12:10

renamed "nan_token" into "missing_value_token" to reduce ambiguity

63a64f9

refactored fill_valid_kernel definition and RunParameterPredictionMod…

f33c7c3

…el call to make HeuristicInit less verbose and easier to read.

Renamed MIOPEN_AI_FDEEP_USE_SINGLE_THREAD_PREDICT env var to make it …

73b8dee

…less ambiguous

Added split_k values to metadata, refactored such that these values a…

f8391b0

…re used for expanding the candidate kernel params for the CandidateSelectionModel. Note: not tested yet.

Merge branch 'develop' into silo/feat/3d_conv_heuristics

452f789

added "split_k_values" to metadata for wrw 3d model

5596cd2

Merge branch 'develop' into silo/feat/3d_conv_heuristics

78c16ec

fixed bugs introduced by last refactor

968fd9d

more fixes after refactor. Also fix related test functions

eb37904

copied back GetWti function that got lost when copying to monorepo

02ccd4a

amd-bartgips commented Aug 13, 2025

View reviewed changes

projects/miopen/src/solver/conv/conv_hip_implicit_gemm_3d_grouped_fwd_xdlops.cpp Show resolved Hide resolved

amd-bartgips commented Aug 13, 2025

View reviewed changes

projects/miopen/src/kernels/gfx942_3d.tn.model Outdated Show resolved Hide resolved

amd-bartgips commented Aug 13, 2025

View reviewed changes

projects/miopen/src/solver/conv/conv_hip_implicit_gemm_3d_grouped_fwd_xdlops.cpp Show resolved Hide resolved

amd-bartgips added 2 commits August 13, 2025 09:52

changed func declaration to avoid warning during build

b3e1daf

fixed bugs with split_k (for wrw solver)

4c906f7

amd-bartgips requested review from BradPepersAMD, cderb and msaudulhassan August 13, 2025 10:08

amd-bartgips added 9 commits August 13, 2025 10:38

removed "main" function def from gtest files

ea4cbac

Merge branch 'develop' into silo/feat/3d_conv_heuristics

f8d3cb7

edited tests to follow naming convention

3fcaca8

refactored candidate_selection_model tests for better maintainability

190adcb

refactored 3d kernel tuning util tests to conform to conventions

7f9fc21

refactored 3d tunanet heuristics tests to conform to conventions

6382d0d

Merge branch 'develop' into silo/feat/3d_conv_heuristics

4bc1742

Merge branch 'develop' into silo/feat/3d_conv_heuristics

39194f9

Added FP32 as datatype name suffix for GPU test functions

84598bb

amd-bartgips added 11 commits September 9, 2025 09:40

added MIOPEN_INTERNALS_EXPORT to temapletes in proper places in sourc…

99cf101

…e file (not in header)

Removed the MIOPEN_INTERNALS_EXPORT from .cpp source file and moved i…

6e546a4

…t back to the correct place in the .hpp header

set and unset env vars in a cross-platform way

7c0b26f

Merge branch 'develop' into silo/feat/3d_conv_heuristics

fa1ac3d

fixed typos

bc7502f

Merge branch 'develop' into silo/feat/3d_conv_heuristics

48941f6

Added missing endif statement

49d602e

Merge branch 'develop' into silo/feat/3d_conv_heuristics

ba2a812

fixed calls to new cross-platf utility functions to set env vars

5dfda34

Merge branch 'develop' into silo/feat/3d_conv_heuristics

4853376

Merge branch 'develop' into silo/feat/3d_conv_heuristics

cef1d41

amd-bartgips commented Sep 18, 2025

View reviewed changes

projects/miopen/src/solver/conv/conv_hip_implicit_gemm_3d_grouped_fwd_xdlops.cpp Outdated Show resolved Hide resolved

amd-bartgips added 2 commits September 19, 2025 08:44

Improved flow of fwd solver and improved/unified logging among the th…

79d907f

…ree solvers for clearer logs (e.g. changed error msg for hard-code heuristics into log_i2)

Merge branch 'develop' into silo/feat/3d_conv_heuristics

e1265d5

git commit renamed Disabled test functions to match naming conventions

7b2149e

cderb approved these changes Sep 19, 2025

View reviewed changes

Merge branch 'develop' into silo/feat/3d_conv_heuristics

1f731c2

amd-bartgips merged commit 422e872 into develop Sep 22, 2025
8 checks passed

amd-bartgips deleted the silo/feat/3d_conv_heuristics branch September 22, 2025 13:29

JonathanLichtnerAMD added a commit that referenced this pull request Sep 23, 2025

Revert "[MIOpen] Implement kernel tuning heuristic model for 3D conv …

40b1c42

…ops (two tower model) (#1154)" This reverts commit 422e872.

JonathanLichtnerAMD mentioned this pull request Sep 23, 2025

Revert "[MIOpen] Implement kernel tuning heuristic model for 3D conv … #1740

Merged

1 task

amd-bartgips mentioned this pull request Sep 25, 2025

[MIOpen] bugfix: Conv 3d AI kernel tuning; kernel does not exist #1748

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MIOpen] Implement kernel tuning heuristic model for 3D conv ops (two tower model) #1154

[MIOpen] Implement kernel tuning heuristic model for 3D conv ops (two tower model) #1154

Uh oh!

amd-bartgips commented Aug 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BradPepersAMD commented Sep 15, 2025

Uh oh!

cderb commented Sep 15, 2025

Uh oh!

amd-bartgips commented Sep 16, 2025 •

edited

Loading

Uh oh!

amd-bartgips commented Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

amd-bartgips commented Sep 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

rocm-devops commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[MIOpen] Implement kernel tuning heuristic model for 3D conv ops (two tower model) #1154

[MIOpen] Implement kernel tuning heuristic model for 3D conv ops (two tower model) #1154

Uh oh!

Conversation

amd-bartgips commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BradPepersAMD commented Sep 15, 2025

Uh oh!

cderb commented Sep 15, 2025

Uh oh!

amd-bartgips commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amd-bartgips commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

amd-bartgips commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rocm-devops commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

amd-bartgips commented Aug 12, 2025 •

edited

Loading

amd-bartgips commented Sep 16, 2025 •

edited

Loading

amd-bartgips commented Sep 16, 2025 •

edited

Loading

amd-bartgips commented Sep 19, 2025 •

edited

Loading