Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Support for Qwen2-VL #9246

Open
4 tasks done
isr431 opened this issue Aug 29, 2024 · 123 comments
Open
4 tasks done

Feature Request: Support for Qwen2-VL #9246

isr431 opened this issue Aug 29, 2024 · 123 comments
Labels
enhancement New feature or request

Comments

@isr431
Copy link

isr431 commented Aug 29, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Qwen just released Qwen2-VL 2B & 7B under the Apache 2.0 License.

Motivation

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

Possible Implementation

No response

@isr431 isr431 added the enhancement New feature or request label Aug 29, 2024
@chigkim
Copy link

chigkim commented Aug 31, 2024

+1 This would be another great addition!

@crzroot
Copy link

crzroot commented Aug 31, 2024

This model is awesome

@suepradun
Copy link

I am looking forward to it very much

@xzlinux
Copy link

xzlinux commented Aug 31, 2024

+1 I am looking forward to it very much

@yukiarimo
Copy link

We can try Llamafing it

@XDesktopSoft
Copy link

+1

7 similar comments
@WildCatApp
Copy link

+1

@uestcbraid
Copy link

+1

@mrhalyang
Copy link

+1

@elyzionz
Copy link

elyzionz commented Sep 2, 2024

+1

@eaucoin
Copy link

eaucoin commented Sep 2, 2024

+1

@Kimizhao
Copy link

Kimizhao commented Sep 3, 2024

+1

@enryteam
Copy link

enryteam commented Sep 4, 2024

+1

@yukiarimo
Copy link

Any updates?

@apipino
Copy link

apipino commented Sep 5, 2024

+1

5 similar comments
@Xhehab
Copy link

Xhehab commented Sep 5, 2024

+1

@Seaman3body
Copy link

+1

@zenoverflow
Copy link

+1

@whoisltd
Copy link

whoisltd commented Sep 6, 2024

+1

@eav-solution
Copy link

+1

@feynmanloo
Copy link

I can not wait for it !!!

@chigkim
Copy link

chigkim commented Sep 8, 2024

Maybe people should also express interest and ask Qwen2-VL devs to implement.
QwenLM/Qwen2-VL#7

@wmx-github
Copy link

Expect to use llama.cpp end side inference

@HimariO
Copy link
Contributor

HimariO commented Sep 11, 2024

Is anyone already working on this? If not, I would like to give it a try.

@solangii
Copy link

+1
is there any updates?

@PredyDaddy
Copy link

+1

2 similar comments
@shobhit9618
Copy link

+1

@zhouxihong1
Copy link

+1

@sakulall
Copy link

sakulall commented Dec 5, 2024

image Today's garbled code phenomenon is not the same as yesterday's, but it is still garbled
Why does it say "minicpmv_init" in your command? Have you correctly compiled the qwen2vl-cli main program? I suspect you try to run the model with the minicpmv main program
@auriocus I saw the minicp-init you mentioned, maybe you're right, I didn't find the right qwen_vl project to compile, I'll keep testing, thanks.

have you switched to the qwen2-vl branch of the repo

@huucuong1503 Thanks to your help, I was able to implement the model enabled, and there was no garbled characters, and I recompiled the branch according to the instructions of the compilation project on your kaggle. The result is a success!!

@sakulall
Copy link

sakulall commented Dec 5, 2024

@sakulall

Why does it say "minicpmv_init" in your command? Have you correctly compiled the qwen2vl-cli main program? I suspect you try to run the model with the minicpmv main program

@auriocus I saw the minicp-init you mentioned, maybe you're right, I didn't find the right qwen_vl project to compile, I'll keep testing, thanks.

Try this patch for the Makefile:

diff --git a/Makefile b/Makefile
index 8a903d7e..51403be2 100644
--- a/Makefile
+++ b/Makefile
@@ -1485,6 +1485,14 @@ libllava.a: examples/llava/llava.cpp \
        $(OBJ_ALL)
        $(CXX) $(CXXFLAGS) -static -fPIC -c $< -o $@ -Wno-cast-qual
 
+llama-qwen2vl-cli: examples/llava/qwen2vl-cli.cpp \
+       examples/llava/llava.cpp \
+       examples/llava/llava.h \
+       examples/llava/clip.cpp \
+       examples/llava/clip.h \
+       $(OBJ_ALL)
+       $(CXX) $(CXXFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) -Wno-cast-qual
+
 llama-llava-cli: examples/llava/llava-cli.cpp \
        examples/llava/llava.cpp \
        examples/llava/llava.h \

and then do: make llama-qwen2vl-cli

Thank you for your help as well.

@sakulall
Copy link

sakulall commented Dec 5, 2024

But I also found a problem, I am on two 4090s, I want to try the 7b model, but I get the error CUDA out of memory, even if the model is quantized, it will exceed the video memory, is there any good way to do this?

@huucuong1503
Copy link

But I also found a problem, I am on two 4090s, I want to try the 7b model, but I get the error CUDA out of memory, even if the model is quantized, it will exceed the video memory, is there any good way to do this?

try to decrease the -ngl
Actually I just run it with ngl =33 and nctx=23000 on jetson orin nx and it work quite well with 9.4gb vram

@sakulall
Copy link

sakulall commented Dec 5, 2024

@huucuong1503 Thanks, I'm currently working on the deployment, but I don't know how to apply this project to my field of work, like you said about this Jetson Orin NX, I just implemented this project, I want to add features to this project, what should I do, can you talk about it briefly?

@huucuong1503
Copy link

huucuong1503 commented Dec 6, 2024

@huucuong1503 Thanks, I'm currently working on the deployment, but I don't know how to apply this project to my field of work, like you said about this Jetson Orin NX, I just implemented this project, I want to add features to this project, what should I do, can you talk about it briefly?

Can you tell me what project you are working on? In my case, im building a local VLM agent that could run on Jetson to control UAVs and Im working on with qwen agent framework for this task. You can edit and modify code in qwen2-vl-cli.cpp and refer to qwen2vl paper for getting the right token.

@sakulall
Copy link

sakulall commented Dec 6, 2024

@huucuong1503 The direction of my current work should be about monitoring embedded devices for a certain scenario, for example, I will continue to pay attention to whether there will be a fire in a scene, I use QWEN for scene understanding, and give me an alarm when a fire occurs. So I just need to modify the code of the qwen2-vl-cli.cpp to do it?

@PredyDaddy
Copy link

@huucuong1503 The direction of my current work should be about monitoring embedded devices for a certain scenario, for example, I will continue to pay attention to whether there will be a fire in a scene, I use QWEN for scene understanding, and give me an alarm when a fire occurs. So I just need to modify the code of the qwen2-vl-cli.cpp to do it?我现在的工作方向应该是监控某个场景的嵌入式设备,比如我会持续关注某个场景是否会发生火灾,我使用QWEN进行场景了解,当发生火灾时进行报警发生火灾。那么我只需要修改qwen2-vl-cli.cpp的代码就可以了?

this demand sounds like you are in a Chinese company? I also use qwen2vl do the fire detction lol

@huucuong1503
Copy link

Oh this task seem quiet simple, you just need to adjust and add some system prompts for a json contruct but I think using an VLM for fire detection is a bit overkill for this task. You can use owl-vit which is really good for detecting open-vocab class. But of course you can contact me by linkedin or gmail for further discussion
[email protected]
https://www.linkedin.com/in/huucuonghcmute/

@sakulall
Copy link

sakulall commented Dec 6, 2024

@huucuong1503 The direction of my current work should be about monitoring embedded devices for a certain scenario, for example, I will continue to pay attention to whether there will be a fire in a scene, I use QWEN for scene understanding, and give me an alarm when a fire occurs. So I just need to modify the code of the qwen2-vl-cli.cpp to do it?我现在的工作方向应该是监控某个场景的嵌入式设备,比如我会持续关注某个场景是否会发生火灾,我使用QWEN进行场景了解,当发生火灾时进行报警发生火灾。那么我只需要修改qwen2-vl-cli.cpp的代码就可以了?

this demand sounds like you are in a Chinese company? I also use qwen2vl do the fire detction lol

Yes, the new business developed by our company wants to develop in the direction of large models.

@sakulall
Copy link

sakulall commented Dec 6, 2024

Oh this task seem quiet simple, you just need to adjust and add some system prompts for a json contruct but I think using an VLM for fire detection is a bit overkill for this task. You can use owl-vit which is really good for detecting open-vocab class. But of course you can contact me by linkedin or gmail for further discussion [email protected] https://www.linkedin.com/in/huucuonghcmute/

Thanks again, I will be in touch with you

@gitl33
Copy link

gitl33 commented Dec 7, 2024

Hi all

in cmake . -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=$(which nvcc) -DTCNN_CUDA_ARCHITECTURES=61

How to build using ' DGGML_SYCL=ON ' to get a build package like this:

llama-b4218-bin-win-sycl-x64.zip

I'll appreciate a lot any help

thanks guys!!

@brianestadimas
Copy link

@huucuong1503 if you have any spare time ,anwser me thank you!!

hey I have uploaded my model, you can check at this https://www.kaggle.com/models/cngnguyntrnhu/qwen2vl_gguf_quantize4_k_m

Thank you so much!

@beginor
Copy link

beginor commented Dec 15, 2024

I have tried llama-qwen2vl-cli with qwen2-vl-72b-instruct-q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with command (M1 Max 64B):

llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj ~/Downloads/qwen2-vl-72b-instruct.f32.mmproj.gguf --image demos/images/03.jpg

Got an error:

ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op

The full output is:

build: 4329 (89d604f2) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama_load_model_from_file: using device Metal (Apple M1 Max) - 57343 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /Users/zhang/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tongyi-qianwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2-VL-...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2 VL 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  16:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  17:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  18:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  20:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 15
llama_model_loader: - kv  24:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-72B-Instruct-GGU...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_0:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q4_K:  401 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 29568
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW) 
llm_load_print_meta: general.name     = Qwen2 VL 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256

llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 45213.45 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   668.25 MiB
..................................................................................................
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init:      Metal KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      Metal compute buffer size =   570.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2806
llama_new_context_with_model: graph splits = 322
ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op
zsh: abort      llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj 

@Vi-cs
Copy link

Vi-cs commented Dec 18, 2024

I have tried llama-qwen2vl-cli with qwen2-vl-72b-instruct-q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with command (M1 Max 64B):

llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj ~/Downloads/qwen2-vl-72b-instruct.f32.mmproj.gguf --image demos/images/03.jpg

Got an error:

ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op

The full output is:

build: 4329 (89d604f2) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama_load_model_from_file: using device Metal (Apple M1 Max) - 57343 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /Users/zhang/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tongyi-qianwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2-VL-...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2 VL 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  16:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  17:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  18:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  20:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 15
llama_model_loader: - kv  24:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-72B-Instruct-GGU...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_0:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q4_K:  401 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 29568
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW) 
llm_load_print_meta: general.name     = Qwen2 VL 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256

llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 45213.45 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   668.25 MiB
..................................................................................................
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init:      Metal KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      Metal compute buffer size =   570.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2806
llama_new_context_with_model: graph splits = 322
ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op
zsh: abort      llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj 

same issue on M4 max 128 GB

@chigkim
Copy link

chigkim commented Dec 19, 2024

Same on M3-Max 64GB
error: unsupported op 'IM2COL'

@bunnyfu
Copy link

bunnyfu commented Dec 19, 2024

Same error on MBP M3-Max 128GB

@chigkim
Copy link

chigkim commented Dec 19, 2024

Those who have a problem with Mac, share your setup, mac spec, which models and quants (both llm and mmproj) you tried here.
#10361

@bunnyfu, @Vi-cs, @beginor

@ggerganov
Copy link
Owner

Mac issues should be fixed with #10896

@remixer-dec
Copy link

remixer-dec commented Dec 19, 2024

I'm getting

2.04.266.280 I encode_image_with_clip: load_image_size 640 512
2.04.266.281 I encode_image_with_clip: image embedding created: 437 tokens
2.04.266.282 I encode_image_with_clip: image encoded in   121.50 ms by CLIP (    0.28 ms per image patch)
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_range_insert

when running images with --parallel >= 3 on CUDA via llama-box server, possibly an issue on their side since server in llama.cpp does not support mmproj

UPD: setting bigger context length seems to help

@chigkim
Copy link

chigkim commented Dec 19, 2024

Thanks! It now works on my m3-max with #10896.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git fetch origin pull/10896/head:pr10896
git checkout pr10896
cmake -B build
cmake --build build --config Release -j
./build/bin/llama-qwen2vl-cli -m xxx.gguf --mmproj yyyy.gguf --image img.png -p "Describe the image."

@beginor
Copy link

beginor commented Dec 20, 2024

I have tried model qwen2-vl-72b-instruct.q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with llama.cpp version 4367, on m1 max, it works with png and jpeg, but it does not works with webp images, the error is:

clip_image_load_from_bytes: failed to decode image bytes
llava_image_embed_make_with_bytes: can't load image from bytes, is it a valid image?load_image: is /Users/zhang/Downloads/xuguimei.webp really an image file?
main: failed to load image ~/Downloads/test.webp. Terminating

@chigkim
Copy link

chigkim commented Dec 20, 2024

I don't think it supports webp. Just convert to png or jpeg for now.

@gaussiangit
Copy link

how do I merge the 2 ggufs for ollama ? llm gguf and vision encodoer gguf merge ?

@chigkim
Copy link

chigkim commented Dec 23, 2024

@gaussiangit Ollama doesn't support qwen2-vl yet.
Feature request for Ollama here:
ollama/ollama#6564

@Cheesper
Copy link

Any updates?

@embedsri
Copy link

I'm able to successfully test llama-qwen2vl-cli to describe an image using qwen2-VL-7B model on Android(Samsung S21+ to be specific). The operation takes reasonable 3-4 minutes with quantization. I'll be looking to include metal or vulkan to further improve performance by using GPU on the phones. Also, repeat this on IOS as well.

@vojtapolasek
Copy link

Hello @embedsri could you please share more details how you did that?
I downloaded official model from Huggingface, I used convert_hg_to_gguf.py to convert the text part and then qwen2vl_surgery script to extract the mmproj from the original model.
I am running it on Fedora with latest llama.cpp (c3f9d25 ) on cpu (AMD Rizen) only, 64 GiB of RAM.
I observe two strange things:

  1. The image is encoded very slowly, it takes like 4 minutes.
  2. I don't get any meaningful output.

See this:

encode_image_with_clip: step 1 of 1 encoded in 234926.53 ms
encode_image_with_clip: all 1 segments encoded in 234926.55 ms
encode_image_with_clip: load_image_size 1536 2048
encode_image_with_clip: image embedding created: 4070 tokens

encode_image_with_clip: image encoded in 234939.22 ms by CLIP (   57.72 ms per image patch)
llama_decode: failed to decode, ret = 1
eval_tokens : failed to eval. token 0/12 (batch size 2048, n_past 4085)

I did not quantize the mmproj model, but I tried quantizing the text model to q4_0, no difference.

@auriocus
Copy link

1. The image is encoded very slowly, it takes like 4 minutes.

yes, this CLIP encoding is quite compute-intensive. Especially with the newest commits where the GPU acceleration was deactivated (because it only ever worked on CUDA and everyone else started complaining), it takes some time. But I also think your image is quite large:

2. I don't get any meaningful output.

See this:

encode_image_with_clip: step 1 of 1 encoded in 234926.53 ms
encode_image_with_clip: all 1 segments encoded in 234926.55 ms
encode_image_with_clip: load_image_size 1536 2048
encode_image_with_clip: image embedding created: 4070 tokens

How did you set the context length? when the image already takes up 4070 tokens, maybe there is nothing left for the prompt and result. I'd first try to downscale the image and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests