feat(ml): rocm #16613

mertalev · 2025-03-05T14:46:15Z

Description

This PR introduces support for AMD GPUs through ROCm. It's a rebased version of #11063 with updated dependencies.

It also once again removes algo caching, as the concurrency issue with caching seems to be more subtle than originally thought. While disabling caching is wasteful (it essentially runs a benchmark every time instead of only once), it's still better than the current alternative of either lowering concurrency to 1 or not having ROCm support.

github-actions · 2025-03-05T14:51:01Z

📖 Documentation deployed to pr-16613.preview.immich.app

.github/workflows/docker.yml

machine-learning/Dockerfile

NicholasFlamy · 2025-03-05T18:19:47Z

.github/workflows/docker.yml

    steps:
-        - name: Login to GitHub Container Registry


There's some changes in indentation as well as changes from double quote to single quote. Was this intended? I know it's from the first commit from the original PR but I don't think that was addressed.

VS Code did this when I saved. I'm not sure why it's different

Is there a PR check that runs prettier on the workflow files? I would think the inconsistency exists because there likely isn't.

zackpollard

Nice! Docker cache appears working with no changes, would you mind changing something within ML itself that would require a source code change and rebuild, just so we can see the cache working in those cases before we merge?

satmandu · 2025-03-07T14:51:31Z

FYI, there's a set of rocm builds available supporting a wider range of AMD hardware, which might be useful:

lamikr/rocm_sdk_builder#216

NicholasFlamy · 2025-03-07T15:27:56Z

FYI, there's a set of rocm builds available supporting a wider range of AMD hardware, which might be useful:

lamikr/rocm_sdk_builder#216

"ROCM SDK Builder 6.1.2 is based on to ROCM 6.1.2"
That's a little older but probably okay. I'm not sure what's the point of using it though. It doesn't support a wider range of hardware from what I can tell. It's the same support as ROCm normally has.

SharkWipf · 2025-03-07T16:14:05Z

Sadly, no, not quite. Official ROCm does not support, for instance, gfx1103 (RX 780M and similar iGPUs, 7940HS and similar APUs).
That said, I don't know if Immich can make use of it, since applications using ROCm need to be built against it I believe. I.e. prebuilt Pytorch builds won't work.
I'm not sure what Immich uses, but I'm chiming in because I would love to run Immich on those iGPUs in question, and they are common in current gen mini PCs.

NicholasFlamy · 2025-03-07T16:19:44Z

Sadly, no, not quite. Official ROCm does not support, for instance, gfx1103 (RX 780M and similar iGPUs, 7940HS and similar APUs). That said, I don't know if Immich can make use of it, since applications using ROCm need to be built against it I believe. I.e. prebuilt Pytorch builds won't work. I'm not sure what Immich uses, but I'm chiming in because I would love to run Immich on those iGPUs in question, and they are common in current gen mini PCs.

The official listed support in the docs is mostly just gfx103X and gfx110X and maybe some other stuff. They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you.

Edit: So my question would be, how does one check what's supported by the build they are running?
Also, we are building onnxruntime from source so if you want more support let us know what command line flags are needed or wtv.

SharkWipf · 2025-03-07T16:34:02Z

They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you.

Yeah, but the official ROCm build will not work with gfx1103 at all, applications built against it (i.e. pytorch prebuilt) will not work with gfx1103, and building against it for gfx1103 will not work either.
I'm not sure what the exact steps are to get gfx1103 in ROCm but I do know it requires a custom build/version of ROCm. And while as you said, AMD's stance is "it may work but we won't help you out", it does not mean it will work without this custom ROCm build.

Edit: So my question would be, how does one check what's supported by the build they are running?

I'm not quite sure. On Fedora, the gfx1103 build is provided as a separate package and listed as a separate folder, but the officially supported gfx1102 falls under gfx1100 here, so it's not a reliable check:

$ ls /usr/lib64/rocm/
gfx10  gfx11  gfx1100  gfx1103  gfx8  gfx9  gfx90a  gfx942

satmandu · 2025-03-07T16:41:20Z

Maybe it would be useful to have two rocm flavored options? One with the current main rocm version, and one with the community version built to support a wider variety of GPUs?

NicholasFlamy · 2025-03-07T16:43:49Z

provided as a separate package and listed as a separate folder

Nice, they split them up by version. Eventually we want to do that to cut down the 30 GB image size. Frigate also splits them up. The current image we build has multiple versions all built into one image.

SharkWipf · 2025-03-07T16:46:11Z

Doing that would also resolve the issue of "official or unofficial build?" I suppose, since you can just provide the official builds for the supported GPUs and the unofficial builds for the non-supported GPUs. But you'd need to provide a lot of images that way.

Edit: FYI:

$ du -hs /usr/lib64/rocm/* | sort -h
0       /usr/lib64/rocm/gfx8
452M    /usr/lib64/rocm/gfx1100
467M    /usr/lib64/rocm/gfx942
1.2G    /usr/lib64/rocm/gfx1103
2.0G    /usr/lib64/rocm/gfx90a
2.3G    /usr/lib64/rocm/gfx10
2.3G    /usr/lib64/rocm/gfx11
5.5G    /usr/lib64/rocm/gfx9

NicholasFlamy · 2025-03-07T21:01:49Z

machine-learning/Dockerfile

+
+WORKDIR /code
+
+RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half


Suggested change

RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half

RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx-dev

Only migraphx-dev is needed as the other 2 are dependencies.

Edit: don't change it now, though, because it's already building.

NicholasFlamy · 2025-03-09T03:14:35Z

machine-learning/Dockerfile

@@ -80,11 +111,14 @@ COPY --from=builder-armnn \
    /opt/ann/build.sh \
    /opt/armnn/

+FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS prod-rocm


I know there were already comments on this, but I think copying the deps manually may result in a smaller, yet still working image. It might be worth re-investigating.

NicholasFlamy · 2025-03-09T20:06:01Z

machine-learning/Dockerfile

@@ -15,6 +15,34 @@ RUN mkdir /opt/armnn && \
    cd /opt/ann && \
    sh build.sh

+# Warning: 25GiB+ disk space required to pull this image
+# TODO: find a way to reduce the image size
+FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS builder-rocm


Nope. Not it.

przemekbialek · 2025-03-09T16:00:50Z

They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you.

Yeah, but the official ROCm build will not work with gfx1103 at all, applications built against it (i.e. pytorch prebuilt) will not work with gfx1103, and building against it for gfx1103 will not work either. I'm not sure what the exact steps are to get gfx1103 in ROCm but I do know it requires a custom build/version of ROCm. And while as you said, AMD's stance is "it may work but we won't help you out", it does not mean it will work without this custom ROCm build.

Edit: So my question would be, how does one check what's supported by the build they are running?

I'm not quite sure. On Fedora, the gfx1103 build is provided as a separate package and listed as a separate folder, but the officially supported gfx1102 falls under gfx1100 here, so it's not a reliable check:
$ ls /usr/lib64/rocm/
gfx10  gfx11  gfx1100  gfx1103  gfx8  gfx9  gfx90a  gfx942

Fedora rocBLAS patch for gfx1103 support looks like copy of gfx1102 (navi33). Only names and ISA versions differ. I diffed changes betwen few files and think that theese are only diferences.

-- phoenix
-- gfx1103
-- [Device 1586]
+- navi33
+- gfx1102
+- [Device 73f0]
 - AllowNoFreeDims: false
   AssignedDerivedParameters: true
   Batched: true
@@ -112,7 +112,7 @@
     GroupLoadStore: false
     GuaranteeNoPartialA: false
     GuaranteeNoPartialB: false
-    ISA: [11, 0, 3]
+    ISA: [11, 0, 2]

I'm intrested in additional gpu support because I have minipc with Ryzen8845HS (Radeon 780M) for testing, and second one with Ryzen5825U.
I tried running ghcr.io/immich-app/immich-machine-learning:pr-16613-rocm version with HSA_OVERRIDE_GFX_VERSION=11.0.0, but this setup crashes my card under heavy load (only default models from immich works and only when I run one type of job in single thread). I read that for 780M best choice is gfx1102 but when I set HSA_OVERRIDE_GFX_VERSION=11.0.2 I have errors. I think its because onnxruntime doesn't have compiled support for this arch. Now I trying to build machine-learning with rocm onnxruntime support with small patch which I think enables gfx900 and gfx1102 support in onnxruntime, so if and when build completes I will try this.

diff --git a/cmake/CMakeLists.txt b/cmake/CMakeLists.txt
index d90a2a355..bb1a7de12 100644
--- a/cmake/CMakeLists.txt
+++ b/cmake/CMakeLists.txt
@@ -295,7 +295,7 @@ if (onnxruntime_USE_ROCM)
   endif()

   if (NOT CMAKE_HIP_ARCHITECTURES)
-    set(CMAKE_HIP_ARCHITECTURES "gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942;gfx1200;gfx1201")
+    set(CMAKE_HIP_ARCHITECTURES "gfx900;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102;gfx940;gfx941;gfx942;gfx1200;gfx1201")
   endif()

   file(GLOB rocm_cmake_components ${onnxruntime_ROCM_HOME}/lib/cmake/*)

SharkWipf · 2025-03-09T16:10:54Z

but this setup crashes my card under heavy load

My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well...

NicholasFlamy · 2025-03-09T19:23:09Z

HSA_OVERRIDE_GFX_VERSION=11.0.2

This is not a valid version from what I've observed. So far, there are only 3 valid options:

HSA_OVERRIDE_GFX_VERSION=11.0.0
HSA_OVERRIDE_GFX_VERSION=10.3.0
HSA_OVERRIDE_GFX_VERSION=9.0.0

przemekbialek · 2025-03-09T19:23:46Z

but this setup crashes my card under heavy load

My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well...

Unfortunately adding support for gfx1102 dosen't solve problems with crashing on Radeon 780M, but I'm happy because I succeeded getting it to work on Ryzen 5825U GPU.

NicholasFlamy · 2025-03-09T19:24:59Z

Radeon 780M

They also specifically say certain iGPUs crash. I would bet that they're just bleading edge.

Ryzen 5825U GPU

That model or similar is known to work.

use OrtMutex

use 3.12 use 1.19.2

guard algo benchmark results mark mutex as mutable re-add /bin/sh (?) use 3.10 use 6.1.2

1.19.2 fix variable name fix variable reference aaaaaaaaaaaaaaaaaaaa

This reverts commit 2c4452f.

This reverts commit c121d3e.

This reverts commit 521f9fb.

mertalev · 2025-03-14T19:37:11Z

I'm removing MIGraphX for now and moving back to direct ROCm. There are some advantages to using MIGraphX, so we might circle back to it down the line. Also updated the PR based on some of the later comments on GPU compatibility,.

mertalev requested a review from bo0tzz as a code owner March 5, 2025 14:46

github-actions bot added documentation Improvements or additions to documentation 🧠machine-learning labels Mar 5, 2025

mertalev added changelog:feature and removed documentation Improvements or additions to documentation labels Mar 5, 2025

github-actions bot added the documentation Improvements or additions to documentation label Mar 5, 2025

bo0tzz reviewed Mar 5, 2025

View reviewed changes

.github/workflows/docker.yml Outdated Show resolved Hide resolved

mertalev mentioned this pull request Mar 5, 2025

feat(ml): introduce support of onnxruntime-rocm for AMD GPU #11063

Closed

NicholasFlamy reviewed Mar 5, 2025

View reviewed changes

.github/workflows/docker.yml Show resolved Hide resolved

NicholasFlamy reviewed Mar 5, 2025

View reviewed changes

machine-learning/Dockerfile Show resolved Hide resolved

NicholasFlamy reviewed Mar 5, 2025

View reviewed changes

zackpollard approved these changes Mar 6, 2025

View reviewed changes

NicholasFlamy mentioned this pull request Mar 6, 2025

state of ROCm on Radeon RX 9000 series ROCm/ROCm#4443

Open

NicholasFlamy reviewed Mar 7, 2025

View reviewed changes

NicholasFlamy reviewed Mar 9, 2025

View reviewed changes

mertalev added 25 commits March 14, 2025 14:23

try mutex for algo cache

44d6da9

use OrtMutex

bump versions, run on mich

6d0b2ef

use 3.12 use 1.19.2

acquire lock before any changes can be made

bb9c09c

guard algo benchmark results mark mutex as mutable re-add /bin/sh (?) use 3.10 use 6.1.2

use composite cache key

87c6168

1.19.2 fix variable name fix variable reference aaaaaaaaaaaaaaaaaaaa

bump deps

4029cc2

disable algo caching

230e10c

fix gha

418593e

try ubuntu runner

5fb5132

actually fix the gha

8854f64

update patch

4cc7b5a

skip mimalloc preload for rocm

91d0054

increase build threads

f889563

increase timeout for rocm

b69424c

Revert "increase timeout for rocm"

e46394b

This reverts commit 2c4452f.

attempt migraphx

354a922

set migraphx_home

fd55067

Revert "set migraphx_home"

ef2133f

This reverts commit c121d3e.

Revert "attempt migraphx"

d39ed8b

This reverts commit 521f9fb.

migraphx, take two

6d6de15

bump rocm

1b14209

allow cpu

cf6b80d

try only targeting migraphx

ac29514

skip tests

e98c09c

migraph ❌

7b9e06b

known issues

7f5211a

mertalev force-pushed the feat/rocm-ep branch from 9858536 to 7f5211a Compare March 14, 2025 18:23

mertalev added 2 commits March 14, 2025 14:34

target gfx900 and gfx1102

075d2c3

mention HSA_USE_SVM

99bf287

update lock

8469dea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ml): rocm #16613

feat(ml): rocm #16613

mertalev commented Mar 5, 2025

github-actions bot commented Mar 5, 2025 •

edited

Loading

NicholasFlamy Mar 5, 2025 •

edited

Loading

mertalev Mar 5, 2025

NicholasFlamy Mar 5, 2025 •

edited

Loading

zackpollard left a comment

satmandu commented Mar 7, 2025

NicholasFlamy commented Mar 7, 2025

SharkWipf commented Mar 7, 2025

NicholasFlamy commented Mar 7, 2025 •

edited

Loading

SharkWipf commented Mar 7, 2025 •

edited

Loading

satmandu commented Mar 7, 2025

NicholasFlamy commented Mar 7, 2025

SharkWipf commented Mar 7, 2025 •

edited

Loading

NicholasFlamy Mar 7, 2025 •

edited

Loading

NicholasFlamy Mar 9, 2025

This comment was marked as resolved.

NicholasFlamy Mar 9, 2025

przemekbialek commented Mar 9, 2025

SharkWipf commented Mar 9, 2025

NicholasFlamy commented Mar 9, 2025

przemekbialek commented Mar 9, 2025

NicholasFlamy commented Mar 9, 2025

mertalev commented Mar 14, 2025


		WORKDIR /code

		RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half

feat(ml): rocm #16613

Are you sure you want to change the base?

feat(ml): rocm #16613

Conversation

mertalev commented Mar 5, 2025

Description

github-actions bot commented Mar 5, 2025 • edited Loading

NicholasFlamy Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

mertalev Mar 5, 2025

Choose a reason for hiding this comment

NicholasFlamy Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

zackpollard left a comment

Choose a reason for hiding this comment

satmandu commented Mar 7, 2025

NicholasFlamy commented Mar 7, 2025

SharkWipf commented Mar 7, 2025

NicholasFlamy commented Mar 7, 2025 • edited Loading

SharkWipf commented Mar 7, 2025 • edited Loading

satmandu commented Mar 7, 2025

NicholasFlamy commented Mar 7, 2025

SharkWipf commented Mar 7, 2025 • edited Loading

NicholasFlamy Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

NicholasFlamy Mar 9, 2025

Choose a reason for hiding this comment

This comment was marked as resolved.

NicholasFlamy Mar 9, 2025

Choose a reason for hiding this comment

przemekbialek commented Mar 9, 2025

SharkWipf commented Mar 9, 2025

NicholasFlamy commented Mar 9, 2025

przemekbialek commented Mar 9, 2025

NicholasFlamy commented Mar 9, 2025

mertalev commented Mar 14, 2025

github-actions bot commented Mar 5, 2025 •

edited

Loading

NicholasFlamy Mar 5, 2025 •

edited

Loading

NicholasFlamy Mar 5, 2025 •

edited

Loading

NicholasFlamy commented Mar 7, 2025 •

edited

Loading

SharkWipf commented Mar 7, 2025 •

edited

Loading

SharkWipf commented Mar 7, 2025 •

edited

Loading

NicholasFlamy Mar 7, 2025 •

edited

Loading