-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor/online repacking #10446
Refactor/online repacking #10446
Conversation
ggml/src/ggml-cpu/ggml-cpu-hbm.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it is good, can not test.
And may not work/build on master branch either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could probably be removed, the normal CPU buffer type calls ggml_aligned_malloc
which already uses HBM. So at the moment this buffer type serves no purpose.
ggml/src/ggml-cpu/ggml-cpu.c
Outdated
int64_t const matmul_num_cols = type_traits_cpu[type].ncols; | ||
ggml_gemv_t const gemv = type_traits_cpu[type].gemv; | ||
//int64_t const matmul_num_cols = type_traits_cpu[type].ncols; | ||
//ggml_gemv_t const gemv = type_traits_cpu[type].gemv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look for me it have not be write for dynamic repack but only with "native" Q4_0_N_M packing.
leave it commented it need some work to be usable on dynamic repacking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to fix to keep support for aarch64 types with models with experts.
36a0406
to
655a3fb
Compare
Overall looks good. I am not sure about removing support for current Q4_0_x_x models, but I guess if we are going to do it, it is better to do it sooner than later. |
yes it will be the main/difficult choice :
|
@slaren I still need your expertise so as not to make too many mistakes. I was looking for where llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c Line 7497 in 9336db4
for me look to be in this function: llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c Lines 13220 to 13223 in 9336db4
Am I right? If yes, look for me that the size is not calculated correctly for llamafile and Q4_0 repacking:
llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c Lines 13277 to 13284 in 9336db4
Note: I'm trying to make it more generic to make it easier to reintegrate the AMX backend so maybe not useful to fix it for now. |
It's OK if we over-allocate a bit of memory for
Isn't |
Yes it is the case for Q4_0_M_N for, so not critical for now. Even if internally it is more a Q8_0_N:
But may not work with other/future case. |
If we remove the old API and make the CPU backend accessible only through ggml-backend, then there will be a context that can be used to store the work buffer. Then the work buffer could simply be a |
So you confirm that for now this is where the size is calculated. |
Yes, the size is calculated in the function |
655a3fb
to
e772df4
Compare
a411d95
to
fd768e0
Compare
fd768e0
to
16154eb
Compare
can find how to enable c++17 for macOS-latest-swift / xcode... |
16154eb
to
dc8adeb
Compare
OK now this is merged #10570 , and c++17 is the default I need some more work. |
dc8adeb
to
1b29245
Compare
1b29245
to
733f891
Compare
@slaren @ggerganov what do you think with this refactor? I tried to make adding a "cpu-extra-buffer" simpler and more general. 🤞 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design looks good, this is a good improvement. Just reformat the code according to the .clang-format
file and remove outdated comments.
- clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64
Enable restrict on C++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job, this is a good improvement.
8e5bd04
to
b14b471
Compare
I updated the size controls, it should be better like this. 🖕 |
We should add debug logs about what repacks are applied to what tensors, so that when running with |
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
be3c64b
to
1221d13
Compare
@ggerganov: I added 2 logs, is that what you were thinking? |
Yes, perfect. Should we merge this or is there anything else you are planning to do? |
For me we can merge it. (if the CI success 🤞) |
@slaren @ggerganov thanks for all you reviews and time. |
Just saw that opened bug, is the implication of this change that I should no longer be making Q4_0_N_M quants then? They seem to be fully removed? |
Yes, support for |
* rename ggml-cpu-aarch64.c to .cpp * reformat extra cpu backend. - clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64 * clang-format * Clean Q4_0_N_M ref Enable restrict on C++ * add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack * added/corrected control on tensor size for Q4 repacking. * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <[email protected]> * add debug logs on repacks. --------- Co-authored-by: Georgi Gerganov <[email protected]>
goal: consolidation of the cpu backend for reintegration of the AMX backend.