Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial gfx1036 and gfx1103 support #111

Merged
merged 2 commits into from
Jul 17, 2024
Merged

Conversation

lamikr
Copy link
Owner

@lamikr lamikr commented Jul 14, 2024

  • initial support for gfx1036 and gfx1103 as a build target
  • updated also the gfx1010 configuration settings to be more similar in composable kernel and miopen

fixes: #101
fixes: #103

@lamikr lamikr force-pushed the wip/rocm_sdk_builder_612_bg103 branch 4 times, most recently from 1e191b4 to 3fc282c Compare July 15, 2024 07:09
@lamikr lamikr force-pushed the wip/rocm_sdk_builder_612_bg103 branch from 1453618 to 3051c3e Compare July 15, 2024 08:09
@jeroen-mostert
Copy link
Contributor

A fresh build of this is currently broken, ROCm-CompilerSupport complains about missing hip/hip_runtime.h. I don't know the cause or fix.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Jul 15, 2024

Hmm, something seems to be going wrong initializing repos. A ./babs.sh -f, ./babs.sh -co now complains for rocRAND. This might be the cause for earlier failures too. Time to nuke everything from orbit and re-download the world, we'll see how that goes.

Edit: ah, it's probably this: 27a9480 . It's a shame there's no easy way to see what requires reinitializing source.

Complete reinit did not fix the hip/hip_runtime.h include error, though. If something has changed upstream, I don't see it. I was able to work around it by adding -DBUILD_TESTING=false to the binfo to skip building the tests, but I'm unclear on why this didn't break earlier.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Jul 15, 2024

Building rocBLAS Tensile fails for the gfx1036, same issue I mentioned in #103 (comment). At least this confirms my own initial attempt at adding support wasn't any worse. :P

The fix is adding (10, 3, 6) to globalParameters(["SupportedISA"]). This was present for tensilelite in the patch, but not Tensile itself. Yay for multiple copies of stuff. Also bizarre that Tensile's build process apparently is OK with trying to generate stuff that is globally marked as unsupported.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Jul 15, 2024

I'll just leave this potentially useful oneliner here:

comm -3 <(rg -lF gfx1035|awk -F/ {print $1}'|sort -u)  <(rg -lF gfx1036|awk -F/ {print $1}'|sort -u)

To be executed in src_projects, it lists all projects where gfx1035 is mentioned but not gfx1036, which likely need extra attention. Of course we can do something similar for gfx1030 vs. gfx103?, to see if the latter needs extra support. False positives and negatives appear to be minimal.

An even neater idea would be to write some kind of auto-patcher that, given the gfx, adds the necessary changes, since it's basically the same things over and over, save for the core definitions of HW capabilities that we do need to supply. Then again, this may be one of those things that ultimately takes more time to develop than it actually saves.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Jul 15, 2024

Interesting stuff so far. Building for the gfx1036 is a disastrous performance regression compared to using HSA_OVERRIDE_GFX_VERSION="10.3.0" -- we're talking up to 6.5x slower for a prompt processing task. Clearly something is going very wrong here (likely to do with missing specialized kernels or not selecting them somehow). The GPU is utilized 100% in both cases. I will have to dig deeper to find out where this is coming from.

As an aside, HSA_OVERRIDE_GFX_VERSION is an extremely effective foot gun. Setting this to a value that is incompatible with the current GPU will cause an access violation if you're lucky, and will cause anything from output corruption, a partially disabled GPU and a complete system crash if you're not. Do not include it as a matter of course, as I did in repeated tests. :P

@lamikr lamikr force-pushed the wip/rocm_sdk_builder_612_bg103 branch from 3051c3e to b14c488 Compare July 15, 2024 16:59
@lamikr lamikr force-pushed the wip/rocm_sdk_builder_612_bg103 branch from b14c488 to d6129bf Compare July 15, 2024 17:31
@lamikr lamikr force-pushed the wip/rocm_sdk_builder_612_bg103 branch from d6129bf to a09b964 Compare July 15, 2024 18:16
- initial support for gfx1036 and gfx1103 as a build target
- updated also the gfx1010 configuration settings to be
  more similar in composable kernel and miopen

fixes: #101
fixes: #103

Signed-off-by: Mika Laitio <[email protected]>
@lamikr lamikr force-pushed the wip/rocm_sdk_builder_612_bg103 branch from 1f1974c to 989fdf5 Compare July 16, 2024 05:39
@lamikr
Copy link
Owner Author

lamikr commented Jul 16, 2024

Sorry, responding little late.

  1. Sorry about the rocRAND issue, I had to push the fix in actually couple of times to sort out the differences. Ealier rocRAND builded the hipRAND, now they need to be done separately. In addition there were some errors earlier in the .gitmodules file itself, and that's way I had forked it to own repo. Those were now fixed, so I could jump back to upstream one.
  2. Do you still have the rocm-compilesupport build problem? Wondering did the default target triple change on llvm compiler caused that on your environment?
  3. I just pushed the Tensile's SupportedIsa fix for 1036 on latest revision of this patch. My patch earlier added there the 1103 but I had forgot to add the 1036 as well.
  4. Can you share the test you are running with and without hsa override? It would me to debug the performance issue.
    One explanation for slower performance could be that rocBLAS for example has kernel logic files for gfx1030 but not for other navi2x device. I would like to figure out how to generate those logic yam files from kernels also to those gpus still missing the support but have not figured out the exact steps yet. There is small thread started than on discussions for it.
  5. I have now also updated the magma to latest code revision in repo, it had some rocm specific fixes.
  6. My build just finished, I had gfx1102 and gfx1103 as a target on fedora 40. I will trigger another fresh build just in case against the latest code version of this pr and let it build over night.

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Jul 16, 2024

I will test all the changes with a fresh build to see if any problems remain.

My perf test is fairly simple so it should be easy to repro. I'm using llama.cpp because it's particularly easy to compile standalone; all you need is hipBLAS and make GGML_HIPBLAS=1 (you have to edit the Makefile to change /opt/rocm to wherever rocm actually lives, I just used ROCM_PATH = ${ROCM_HOME} -- ensure it can run amdgpu-arch correctly or it won't compile specialized kernels). Then acquire any model in GGUF form (for example a small one like Phi 3) and run

CUDA_VISIBLE_DEVICES=... ./llama-cli -m Phi-3-mini-4k-instruct-q4.gguf -ngl 0 -n 10 -f <(printf 'banana %0.s {1..50}')

The -ngl 0 offloads no layers to the CPU, so we're only testing prompt processing speed. Use CUDA_VISIBLE_DEVICES to ensure it won't try for a multi-GPU setup. With a dedicated build, I get a speed of 9 tokens/sec. Using HSA_OVERRIDE_GFX_VERSION="10.3.0", it's up to 59 tokens/sec.

There's probably even simpler tests that don't require even a model by just poking BLAS directly with matrix multiplications (llama actually includes a matrix multiplication benchmark, but it only tests CPU perf).

@lamikr
Copy link
Owner Author

lamikr commented Jul 16, 2024

Thanks for confirming, I try to run llama.ccp. I did some research yesterday for existing rocBLAS tests and found these two in addition of tests on rocBLAS itself.
https://github.com/TorreZuk/rocBLAS-Examples-1/tree/develop
https://github.com/LeiWang1999/rocblas-benchmark

I launched 2 clean builds on last night and the one for gfx1102 and gfx1103 has just finished on fedora 40.
Another build on Mageia where the targets are gfx1030, gfx1035 and gfx1036 is still building but I believe it will also pass.
And I have tests build ok on builddir/009_02_rocm-compilersupport/test directory.

@lamikr
Copy link
Owner Author

lamikr commented Jul 16, 2024

I may have been able reproduce and fix the rocBLAS slowness what you are seeing if not using the HSA override with my gfx1103; I did following steps to test it.

  1. prepare benchmark
# git clone https://github.com/LeiWang1999/rocblas-benchmark
# cd rocblas-benchmark
- comment out all except one test to speedup testing from ../rocblas_benchmark.cpp. std::make_tuple(8192, 8192, 8192, false, false, enable_tune),
# mkdir build
# cd build
# cmake ..
# make


  1. Disable gfx1102 so that tests are run on gfx1103
# amd-smi list

GPU: 0
    BDF: 0000:03:00.0
    UUID: 00ff7480-0000-1000-8000-000000000000

GPU: 1
    BDF: 0000:c4:00.0
    UUID: 00ff15bf-0000-1000-8000-000000000000

# export ROCR_VISIBLE_DEVICES="1
  1. Run test without hsa-override on gfx1103 (times 916.684, 820.721, 823.48, 1018.46(
#./rocblas_benchmark 
Device 0: AMD Radeon 780M
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,916.684,820.721,823.48,1018.46

  1. Run same test with hsa override with gfx1103 (5x improvement on times for some tests)
# HSA_OVERRIDE_GFX_VERSION=11.0.2 ./rocblas_benchmark 
Device 0: AMD Radeon 780M
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,2105.05,984.701,202.698,208.54

  1. After adding the logic files to rocBLAS for phoenix-gpu (1103) and rebuilding rocBLAS. Results are now close to eaerlier results with the HSA_OVERRIDE.
# ./rocblas_benchmark 
Device 0: AMD Radeon 780M
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,2121.01,960.537,203.202,208.946

@jeroen-mostert
Copy link
Contributor

jeroen-mostert commented Jul 16, 2024

That's promising. I wonder if any specific gains could be made by specializing things for the cache size/batch size/CU occupancy appropriate for the iGPU rather than the general arch (mind you I haven't looked into these files so I might be talking out of my behind here).

In other news, my build is finally done -- if I never have to see aotriton compile again, I'll be a happy man -- and I can confirm it works from scratch now without further tweaking (gfx1030, gfx1036, gfx1102 targets, for completeness).

Side note in case you didn't already know: Linux kernel 6.10 (which went final a few days ago) has tweaks in amdkfd specifically to improve the memory allocation story for iGPUs. Instead of using the locked memory, it will go straight to GTT, making it much easier to process large data without changing software or reserving large amounts of memory through BIOS settings. I have been running 6.10 since before I knew this was a thing, which explains why nvtop never showed any memory allocated to the iGPU (it doesn't show GTT allocations, amdgpu_top --smi does, even if it doesn't look as snazzy).

@lamikr
Copy link
Owner Author

lamikr commented Jul 16, 2024

My old top dislike has traditionally beeen the rocFFT database generation phase but I agree that the aotriton build is also bad :-)

I would like to have benchmark where we could run same operations both on CPU, GPU and iGPU.
Or course in big matrix opertions the vector computers and GPU's that can run the same instruction on multiple CU'S at the same time are unbeatable but it could really be that there are some tasks where CPU is anyway better. But it's not easy to detect without benchmarking and profiling. If that data is available, then Amdahls law provides a quite good quide where does it makes most sense to put efforts. (Intel researchers published about 10-15 years ago pretty good paper about pros and cons of CPU vs GPUs on different tasks on hardware of that time. Same principles still exist)

Good benchmarks would be then also nice to run both with the older 6,8/9 kernel and with the 6,10, those changes looks very interesting and I wanna build and boot with that to test also once the apu acceleration is in place.

- add initial rocBLAS logic files for
  rembrandt (gfx1035), raphael (gfx1036)
  and phoenix (gfx1103) iGPUs.
- when testing with the
  https://github.com/LeiWang1999/rocblas-benchmark
  by using the std::make_tuple(8192, 8192, 8192, false, false, enable_tune),
  the speedup was about 4-5x.
- gfx1035 without logic files

Device 0: AMD Radeon Graphics
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,912.287,814.502,854.257,865.103

- gfx1035 with logic files

Device 0: AMD Radeon Graphics
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,652.499,834.796,237.42,189.945

- gfx1103 without logic files
Device 0: AMD Radeon 780M
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,916.684,820.721,823.48,1018.46

- gfx1103 with logic files
ROCR_VISIBLE_DEVICES="1" ./rocblas_benchmark
Device 0: AMD Radeon 780M
m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec)
8192,8192,8192,n,n,0,1346.02,634.836,193.613,119.29

Signed-off-by: Mika Laitio <[email protected]>
@lamikr
Copy link
Owner Author

lamikr commented Jul 17, 2024

I will merge these now, let's keep this discussion on going on

I moved our discussion to #114

@lamikr lamikr merged commit ee4b0fa into master Jul 17, 2024
@lamikr lamikr deleted the wip/rocm_sdk_builder_612_bg103 branch July 17, 2024 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

framework laptop 16 hybrid gpu support gfx1036 support?
2 participants