Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ml): rocm #16613

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

feat(ml): rocm #16613

wants to merge 29 commits into from

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented Mar 5, 2025

Description

This PR introduces support for AMD GPUs through ROCm. It's a rebased version of #11063 with updated dependencies.

It also once again removes algo caching, as the concurrency issue with caching seems to be more subtle than originally thought. While disabling caching is wasteful (it essentially runs a benchmark every time instead of only once), it's still better than the current alternative of either lowering concurrency to 1 or not having ROCm support.

@mertalev mertalev requested a review from bo0tzz as a code owner March 5, 2025 14:46
@github-actions github-actions bot added documentation Improvements or additions to documentation 🧠machine-learning labels Mar 5, 2025
@mertalev mertalev added changelog:feature and removed documentation Improvements or additions to documentation labels Mar 5, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 5, 2025
Copy link
Contributor

github-actions bot commented Mar 5, 2025

📖 Documentation deployed to pr-16613.preview.immich.app

steps:
- name: Login to GitHub Container Registry
Copy link
Member

@NicholasFlamy NicholasFlamy Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some changes in indentation as well as changes from double quote to single quote. Was this intended? I know it's from the first commit from the original PR but I don't think that was addressed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS Code did this when I saved. I'm not sure why it's different

Copy link
Member

@NicholasFlamy NicholasFlamy Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a PR check that runs prettier on the workflow files? I would think the inconsistency exists because there likely isn't.

Copy link
Contributor

@zackpollard zackpollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Docker cache appears working with no changes, would you mind changing something within ML itself that would require a source code change and rebuild, just so we can see the cache working in those cases before we merge?

@satmandu
Copy link

satmandu commented Mar 7, 2025

FYI, there's a set of rocm builds available supporting a wider range of AMD hardware, which might be useful:

lamikr/rocm_sdk_builder#216

@NicholasFlamy
Copy link
Member

FYI, there's a set of rocm builds available supporting a wider range of AMD hardware, which might be useful:

lamikr/rocm_sdk_builder#216

"ROCM SDK Builder 6.1.2 is based on to ROCM 6.1.2"
That's a little older but probably okay. I'm not sure what's the point of using it though. It doesn't support a wider range of hardware from what I can tell. It's the same support as ROCm normally has.

@SharkWipf
Copy link

Sadly, no, not quite. Official ROCm does not support, for instance, gfx1103 (RX 780M and similar iGPUs, 7940HS and similar APUs).
That said, I don't know if Immich can make use of it, since applications using ROCm need to be built against it I believe. I.e. prebuilt Pytorch builds won't work.
I'm not sure what Immich uses, but I'm chiming in because I would love to run Immich on those iGPUs in question, and they are common in current gen mini PCs.

@NicholasFlamy
Copy link
Member

NicholasFlamy commented Mar 7, 2025

Sadly, no, not quite. Official ROCm does not support, for instance, gfx1103 (RX 780M and similar iGPUs, 7940HS and similar APUs). That said, I don't know if Immich can make use of it, since applications using ROCm need to be built against it I believe. I.e. prebuilt Pytorch builds won't work. I'm not sure what Immich uses, but I'm chiming in because I would love to run Immich on those iGPUs in question, and they are common in current gen mini PCs.

The official listed support in the docs is mostly just gfx103X and gfx110X and maybe some other stuff. They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you.

Edit: So my question would be, how does one check what's supported by the build they are running?
Also, we are building onnxruntime from source so if you want more support let us know what command line flags are needed or wtv.

@SharkWipf
Copy link

SharkWipf commented Mar 7, 2025

They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you.

Yeah, but the official ROCm build will not work with gfx1103 at all, applications built against it (i.e. pytorch prebuilt) will not work with gfx1103, and building against it for gfx1103 will not work either.
I'm not sure what the exact steps are to get gfx1103 in ROCm but I do know it requires a custom build/version of ROCm. And while as you said, AMD's stance is "it may work but we won't help you out", it does not mean it will work without this custom ROCm build.

Edit: So my question would be, how does one check what's supported by the build they are running?

I'm not quite sure. On Fedora, the gfx1103 build is provided as a separate package and listed as a separate folder, but the officially supported gfx1102 falls under gfx1100 here, so it's not a reliable check:

$ ls /usr/lib64/rocm/
gfx10  gfx11  gfx1100  gfx1103  gfx8  gfx9  gfx90a  gfx942

@satmandu
Copy link

satmandu commented Mar 7, 2025

Maybe it would be useful to have two rocm flavored options? One with the current main rocm version, and one with the community version built to support a wider variety of GPUs?

@NicholasFlamy
Copy link
Member

provided as a separate package and listed as a separate folder

Nice, they split them up by version. Eventually we want to do that to cut down the 30 GB image size. Frigate also splits them up. The current image we build has multiple versions all built into one image.

@SharkWipf
Copy link

SharkWipf commented Mar 7, 2025

Doing that would also resolve the issue of "official or unofficial build?" I suppose, since you can just provide the official builds for the supported GPUs and the unofficial builds for the non-supported GPUs. But you'd need to provide a lot of images that way.

Edit: FYI:

$ du -hs /usr/lib64/rocm/* | sort -h
0       /usr/lib64/rocm/gfx8
452M    /usr/lib64/rocm/gfx1100
467M    /usr/lib64/rocm/gfx942
1.2G    /usr/lib64/rocm/gfx1103
2.0G    /usr/lib64/rocm/gfx90a
2.3G    /usr/lib64/rocm/gfx10
2.3G    /usr/lib64/rocm/gfx11
5.5G    /usr/lib64/rocm/gfx9


WORKDIR /code

RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half
Copy link
Member

@NicholasFlamy NicholasFlamy Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx migraphx-dev half
RUN apt-get update && apt-get install -y --no-install-recommends wget git python3.10-venv migraphx-dev

Only migraphx-dev is needed as the other 2 are dependencies.

Edit: don't change it now, though, because it's already building.

@@ -80,11 +111,14 @@ COPY --from=builder-armnn \
/opt/ann/build.sh \
/opt/armnn/

FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS prod-rocm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there were already comments on this, but I think copying the deps manually may result in a smaller, yet still working image. It might be worth re-investigating.

@@ -15,6 +15,34 @@ RUN mkdir /opt/armnn && \
cd /opt/ann && \
sh build.sh

# Warning: 25GiB+ disk space required to pull this image
# TODO: find a way to reduce the image size
FROM rocm/dev-ubuntu-22.04:6.3.4-complete AS builder-rocm

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Not it.

@przemekbialek
Copy link

They're inconsistent and define supported as our team will help you on GitHub with certain stuff but anything not on the list may work (eg. Vega GPUs work fine) but they won't help you.

Yeah, but the official ROCm build will not work with gfx1103 at all, applications built against it (i.e. pytorch prebuilt) will not work with gfx1103, and building against it for gfx1103 will not work either. I'm not sure what the exact steps are to get gfx1103 in ROCm but I do know it requires a custom build/version of ROCm. And while as you said, AMD's stance is "it may work but we won't help you out", it does not mean it will work without this custom ROCm build.

Edit: So my question would be, how does one check what's supported by the build they are running?

I'm not quite sure. On Fedora, the gfx1103 build is provided as a separate package and listed as a separate folder, but the officially supported gfx1102 falls under gfx1100 here, so it's not a reliable check:

$ ls /usr/lib64/rocm/
gfx10  gfx11  gfx1100  gfx1103  gfx8  gfx9  gfx90a  gfx942

Fedora rocBLAS patch for gfx1103 support looks like copy of gfx1102 (navi33). Only names and ISA versions differ. I diffed changes betwen few files and think that theese are only diferences.

-- phoenix
-- gfx1103
-- [Device 1586]
+- navi33
+- gfx1102
+- [Device 73f0]
 - AllowNoFreeDims: false
   AssignedDerivedParameters: true
   Batched: true
@@ -112,7 +112,7 @@
     GroupLoadStore: false
     GuaranteeNoPartialA: false
     GuaranteeNoPartialB: false
-    ISA: [11, 0, 3]
+    ISA: [11, 0, 2]

I'm intrested in additional gpu support because I have minipc with Ryzen8845HS (Radeon 780M) for testing, and second one with Ryzen5825U.
I tried running ghcr.io/immich-app/immich-machine-learning:pr-16613-rocm version with HSA_OVERRIDE_GFX_VERSION=11.0.0, but this setup crashes my card under heavy load (only default models from immich works and only when I run one type of job in single thread). I read that for 780M best choice is gfx1102 but when I set HSA_OVERRIDE_GFX_VERSION=11.0.2 I have errors. I think its because onnxruntime doesn't have compiled support for this arch. Now I trying to build machine-learning with rocm onnxruntime support with small patch which I think enables gfx900 and gfx1102 support in onnxruntime, so if and when build completes I will try this.

diff --git a/cmake/CMakeLists.txt b/cmake/CMakeLists.txt
index d90a2a355..bb1a7de12 100644
--- a/cmake/CMakeLists.txt
+++ b/cmake/CMakeLists.txt
@@ -295,7 +295,7 @@ if (onnxruntime_USE_ROCM)
   endif()

   if (NOT CMAKE_HIP_ARCHITECTURES)
-    set(CMAKE_HIP_ARCHITECTURES "gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx940;gfx941;gfx942;gfx1200;gfx1201")
+    set(CMAKE_HIP_ARCHITECTURES "gfx900;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx1102;gfx940;gfx941;gfx942;gfx1200;gfx1201")
   endif()

   file(GLOB rocm_cmake_components ${onnxruntime_ROCM_HOME}/lib/cmake/*)

@SharkWipf
Copy link

but this setup crashes my card under heavy load

My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well...

@NicholasFlamy
Copy link
Member

HSA_OVERRIDE_GFX_VERSION=11.0.2

This is not a valid version from what I've observed. So far, there are only 3 valid options:

HSA_OVERRIDE_GFX_VERSION=11.0.0
HSA_OVERRIDE_GFX_VERSION=10.3.0
HSA_OVERRIDE_GFX_VERSION=9.0.0

@przemekbialek
Copy link

but this setup crashes my card under heavy load

My 780m locks up my desktop roughly 50% of the time when using ROCm llama.cpp/whisper.cpp with any ROCm version (1100, 1102, 1103). I'd hoped it would be less of an issue headless or with different applications, but if you have the same issue with Immich that does not bode well...

Unfortunately adding support for gfx1102 dosen't solve problems with crashing on Radeon 780M, but I'm happy because I succeeded getting it to work on Ryzen 5825U GPU.

@NicholasFlamy
Copy link
Member

Radeon 780M

They also specifically say certain iGPUs crash. I would bet that they're just bleading edge.

Ryzen 5825U GPU

That model or similar is known to work.

@mertalev
Copy link
Contributor Author

I'm removing MIGraphX for now and moving back to direct ROCm. There are some advantages to using MIGraphX, so we might circle back to it down the line. Also updated the PR based on some of the later comments on GPU compatibility,.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:feature documentation Improvements or additions to documentation 🧠machine-learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants