Feature Request: Support for Qwen2-VL #9246

isr431 · 2024-08-29T22:34:11Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Qwen just released Qwen2-VL 2B & 7B under the Apache 2.0 License.

Motivation

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

Possible Implementation

No response

chigkim · 2024-08-31T00:07:21Z

+1 This would be another great addition!

crzroot · 2024-08-31T02:37:38Z

This model is awesome

suepradun · 2024-08-31T03:07:02Z

I am looking forward to it very much

xzlinux · 2024-08-31T12:06:23Z

+1 I am looking forward to it very much

yukiarimo · 2024-08-31T23:05:39Z

We can try Llamafing it

XDesktopSoft · 2024-09-01T02:31:01Z

+1

WildCatApp · 2024-09-01T11:20:41Z

+1

uestcbraid · 2024-09-02T01:30:37Z

+1

mrhalyang · 2024-09-02T02:40:32Z

+1

elyzionz · 2024-09-02T05:41:09Z

+1

eaucoin · 2024-09-02T20:35:37Z

+1

Kimizhao · 2024-09-03T03:20:07Z

+1

enryteam · 2024-09-04T09:22:58Z

+1

yukiarimo · 2024-09-04T10:38:59Z

Any updates?

apipino · 2024-09-05T01:09:30Z

+1

Xhehab · 2024-09-05T04:02:21Z

+1

Seaman3body · 2024-09-05T13:43:32Z

+1

zenoverflow · 2024-09-05T15:39:29Z

+1

whoisltd · 2024-09-06T03:12:55Z

+1

eav-solution · 2024-09-07T16:16:31Z

+1

feynmanloo · 2024-09-08T16:16:50Z

I can not wait for it !!!

chigkim · 2024-09-08T19:04:58Z

Maybe people should also express interest and ask Qwen2-VL devs to implement.
QwenLM/Qwen2-VL#7

wmx-github · 2024-09-11T01:56:57Z

Expect to use llama.cpp end side inference

HimariO · 2024-09-11T02:29:22Z

Is anyone already working on this? If not, I would like to give it a try.

solangii · 2024-09-11T08:17:54Z

+1
is there any updates?

PredyDaddy · 2024-09-12T09:06:51Z

+1

shobhit9618 · 2024-09-12T12:32:23Z

+1

zhouxihong1 · 2024-09-13T08:08:20Z

+1

sakulall · 2024-12-05T10:22:21Z

Today's garbled code phenomenon is not the same as yesterday's, but it is still garbled
Why does it say "minicpmv_init" in your command? Have you correctly compiled the qwen2vl-cli main program? I suspect you try to run the model with the minicpmv main program
@auriocus I saw the minicp-init you mentioned, maybe you're right, I didn't find the right qwen_vl project to compile, I'll keep testing, thanks.

have you switched to the qwen2-vl branch of the repo

@huucuong1503 Thanks to your help, I was able to implement the model enabled, and there was no garbled characters, and I recompiled the branch according to the instructions of the compilation project on your kaggle. The result is a success!!

sakulall · 2024-12-05T10:23:04Z

@sakulall

Why does it say "minicpmv_init" in your command? Have you correctly compiled the qwen2vl-cli main program? I suspect you try to run the model with the minicpmv main program

@auriocus I saw the minicp-init you mentioned, maybe you're right, I didn't find the right qwen_vl project to compile, I'll keep testing, thanks.

Try this patch for the Makefile:
diff --git a/Makefile b/Makefile
index 8a903d7e..51403be2 100644
--- a/Makefile
+++ b/Makefile
@@ -1485,6 +1485,14 @@ libllava.a: examples/llava/llava.cpp \
        $(OBJ_ALL)
        $(CXX) $(CXXFLAGS) -static -fPIC -c $< -o $@ -Wno-cast-qual
 
+llama-qwen2vl-cli: examples/llava/qwen2vl-cli.cpp \
+       examples/llava/llava.cpp \
+       examples/llava/llava.h \
+       examples/llava/clip.cpp \
+       examples/llava/clip.h \
+       $(OBJ_ALL)
+       $(CXX) $(CXXFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) -Wno-cast-qual
+
 llama-llava-cli: examples/llava/llava-cli.cpp \
        examples/llava/llava.cpp \
        examples/llava/llava.h \
and then do: make llama-qwen2vl-cli

Thank you for your help as well.

sakulall · 2024-12-05T10:26:25Z

But I also found a problem, I am on two 4090s, I want to try the 7b model, but I get the error CUDA out of memory, even if the model is quantized, it will exceed the video memory, is there any good way to do this?

huucuong1503 · 2024-12-05T10:31:45Z

But I also found a problem, I am on two 4090s, I want to try the 7b model, but I get the error CUDA out of memory, even if the model is quantized, it will exceed the video memory, is there any good way to do this?

try to decrease the -ngl
Actually I just run it with ngl =33 and nctx=23000 on jetson orin nx and it work quite well with 9.4gb vram

sakulall · 2024-12-05T11:16:15Z

@huucuong1503 Thanks, I'm currently working on the deployment, but I don't know how to apply this project to my field of work, like you said about this Jetson Orin NX, I just implemented this project, I want to add features to this project, what should I do, can you talk about it briefly?

huucuong1503 · 2024-12-06T01:24:16Z

@huucuong1503 Thanks, I'm currently working on the deployment, but I don't know how to apply this project to my field of work, like you said about this Jetson Orin NX, I just implemented this project, I want to add features to this project, what should I do, can you talk about it briefly?

Can you tell me what project you are working on? In my case, im building a local VLM agent that could run on Jetson to control UAVs and Im working on with qwen agent framework for this task. You can edit and modify code in qwen2-vl-cli.cpp and refer to qwen2vl paper for getting the right token.

sakulall · 2024-12-06T02:39:26Z

@huucuong1503 The direction of my current work should be about monitoring embedded devices for a certain scenario, for example, I will continue to pay attention to whether there will be a fire in a scene, I use QWEN for scene understanding, and give me an alarm when a fire occurs. So I just need to modify the code of the qwen2-vl-cli.cpp to do it?

PredyDaddy · 2024-12-06T03:34:35Z

@huucuong1503 The direction of my current work should be about monitoring embedded devices for a certain scenario, for example, I will continue to pay attention to whether there will be a fire in a scene, I use QWEN for scene understanding, and give me an alarm when a fire occurs. So I just need to modify the code of the qwen2-vl-cli.cpp to do it?我现在的工作方向应该是监控某个场景的嵌入式设备，比如我会持续关注某个场景是否会发生火灾，我使用QWEN进行场景了解，当发生火灾时进行报警发生火灾。那么我只需要修改qwen2-vl-cli.cpp的代码就可以了？

this demand sounds like you are in a Chinese company? I also use qwen2vl do the fire detction lol

huucuong1503 · 2024-12-06T03:52:27Z

Oh this task seem quiet simple, you just need to adjust and add some system prompts for a json contruct but I think using an VLM for fire detection is a bit overkill for this task. You can use owl-vit which is really good for detecting open-vocab class. But of course you can contact me by linkedin or gmail for further discussion
[email protected]
https://www.linkedin.com/in/huucuonghcmute/

sakulall · 2024-12-06T03:56:18Z

@huucuong1503 The direction of my current work should be about monitoring embedded devices for a certain scenario, for example, I will continue to pay attention to whether there will be a fire in a scene, I use QWEN for scene understanding, and give me an alarm when a fire occurs. So I just need to modify the code of the qwen2-vl-cli.cpp to do it?我现在的工作方向应该是监控某个场景的嵌入式设备，比如我会持续关注某个场景是否会发生火灾，我使用QWEN进行场景了解，当发生火灾时进行报警发生火灾。那么我只需要修改qwen2-vl-cli.cpp的代码就可以了？

this demand sounds like you are in a Chinese company? I also use qwen2vl do the fire detction lol

Yes, the new business developed by our company wants to develop in the direction of large models.

sakulall · 2024-12-06T03:57:01Z

Oh this task seem quiet simple, you just need to adjust and add some system prompts for a json contruct but I think using an VLM for fire detection is a bit overkill for this task. You can use owl-vit which is really good for detecting open-vocab class. But of course you can contact me by linkedin or gmail for further discussion [email protected] https://www.linkedin.com/in/huucuonghcmute/

Thanks again, I will be in touch with you

gitl33 · 2024-12-07T05:29:23Z

Hi all

in cmake . -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=$(which nvcc) -DTCNN_CUDA_ARCHITECTURES=61

How to build using ' DGGML_SYCL=ON ' to get a build package like this:

llama-b4218-bin-win-sycl-x64.zip

I'll appreciate a lot any help

thanks guys!!

brianestadimas · 2024-12-13T10:50:04Z

@huucuong1503 if you have any spare time ，anwser me thank you！！

hey I have uploaded my model, you can check at this https://www.kaggle.com/models/cngnguyntrnhu/qwen2vl_gguf_quantize4_k_m

Thank you so much!

beginor · 2024-12-15T12:11:59Z

I have tried llama-qwen2vl-cli with qwen2-vl-72b-instruct-q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with command (M1 Max 64B):

llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj ~/Downloads/qwen2-vl-72b-instruct.f32.mmproj.gguf --image demos/images/03.jpg

Got an error:

ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op

The full output is:

build: 4329 (89d604f2) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama_load_model_from_file: using device Metal (Apple M1 Max) - 57343 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /Users/zhang/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tongyi-qianwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2-VL-...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2 VL 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  16:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  17:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  18:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  20:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 15
llama_model_loader: - kv  24:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-72B-Instruct-GGU...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_0:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q4_K:  401 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 29568
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW) 
llm_load_print_meta: general.name     = Qwen2 VL 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256

llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 45213.45 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   668.25 MiB
..................................................................................................
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init:      Metal KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      Metal compute buffer size =   570.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2806
llama_new_context_with_model: graph splits = 322
ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op
zsh: abort      llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj

Vi-cs · 2024-12-18T20:19:31Z

I have tried llama-qwen2vl-cli with qwen2-vl-72b-instruct-q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with command (M1 Max 64B):

llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj ~/Downloads/qwen2-vl-72b-instruct.f32.mmproj.gguf --image demos/images/03.jpg

Got an error:

ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op

The full output is:

build: 4329 (89d604f2) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama_load_model_from_file: using device Metal (Apple M1 Max) - 57343 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /Users/zhang/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tongyi-qianwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2-VL-...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2 VL 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  16:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  17:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  18:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  20:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 15
llama_model_loader: - kv  24:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-72B-Instruct-GGU...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_0:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q4_K:  401 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 29568
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW) 
llm_load_print_meta: general.name     = Qwen2 VL 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256

llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 45213.45 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   668.25 MiB
..................................................................................................
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init:      Metal KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      Metal compute buffer size =   570.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2806
llama_new_context_with_model: graph splits = 322
ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op
zsh: abort      llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj

same issue on M4 max 128 GB

chigkim · 2024-12-19T01:07:40Z

Same on M3-Max 64GB
error: unsupported op 'IM2COL'

bunnyfu · 2024-12-19T11:55:12Z

Same error on MBP M3-Max 128GB

chigkim · 2024-12-19T12:30:57Z

Those who have a problem with Mac, share your setup, mac spec, which models and quants (both llm and mmproj) you tried here.
#10361

@bunnyfu, @Vi-cs, @beginor

ggerganov · 2024-12-19T12:56:10Z

Mac issues should be fixed with #10896

remixer-dec · 2024-12-19T13:50:52Z

I'm getting

2.04.266.280 I encode_image_with_clip: load_image_size 640 512
2.04.266.281 I encode_image_with_clip: image embedding created: 437 tokens
2.04.266.282 I encode_image_with_clip: image encoded in   121.50 ms by CLIP (    0.28 ms per image patch)
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_range_insert

when running images with --parallel >= 3 on CUDA via llama-box server, possibly an issue on their side since server in llama.cpp does not support mmproj

UPD: setting bigger context length seems to help

chigkim · 2024-12-19T14:53:29Z

Thanks! It now works on my m3-max with #10896.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git fetch origin pull/10896/head:pr10896
git checkout pr10896
cmake -B build
cmake --build build --config Release -j
./build/bin/llama-qwen2vl-cli -m xxx.gguf --mmproj yyyy.gguf --image img.png -p "Describe the image."

beginor · 2024-12-20T02:44:20Z

I have tried model qwen2-vl-72b-instruct.q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with llama.cpp version 4367, on m1 max, it works with png and jpeg, but it does not works with webp images, the error is:

clip_image_load_from_bytes: failed to decode image bytes
llava_image_embed_make_with_bytes: can't load image from bytes, is it a valid image?load_image: is /Users/zhang/Downloads/xuguimei.webp really an image file?
main: failed to load image ~/Downloads/test.webp. Terminating

chigkim · 2024-12-20T12:33:26Z

I don't think it supports webp. Just convert to png or jpeg for now.

gaussiangit · 2024-12-22T10:30:54Z

how do I merge the 2 ggufs for ollama ? llm gguf and vision encodoer gguf merge ?

chigkim · 2024-12-23T04:34:24Z

@gaussiangit Ollama doesn't support qwen2-vl yet.
Feature request for Ollama here:
ollama/ollama#6564

Cheesper · 2024-12-24T12:02:12Z

Any updates?

embedsri · 2024-12-26T23:33:43Z

I'm able to successfully test llama-qwen2vl-cli to describe an image using qwen2-VL-7B model on Android(Samsung S21+ to be specific). The operation takes reasonable 3-4 minutes with quantization. I'll be looking to include metal or vulkan to further improve performance by using GPU on the phones. Also, repeat this on IOS as well.

vojtapolasek · 2025-01-10T08:40:16Z

Hello @embedsri could you please share more details how you did that?
I downloaded official model from Huggingface, I used convert_hg_to_gguf.py to convert the text part and then qwen2vl_surgery script to extract the mmproj from the original model.
I am running it on Fedora with latest llama.cpp (c3f9d25 ) on cpu (AMD Rizen) only, 64 GiB of RAM.
I observe two strange things:

The image is encoded very slowly, it takes like 4 minutes.
I don't get any meaningful output.

See this:

encode_image_with_clip: step 1 of 1 encoded in 234926.53 ms
encode_image_with_clip: all 1 segments encoded in 234926.55 ms
encode_image_with_clip: load_image_size 1536 2048
encode_image_with_clip: image embedding created: 4070 tokens

encode_image_with_clip: image encoded in 234939.22 ms by CLIP (   57.72 ms per image patch)
llama_decode: failed to decode, ret = 1
eval_tokens : failed to eval. token 0/12 (batch size 2048, n_past 4085)

I did not quantize the mmproj model, but I tried quantizing the text model to q4_0, no difference.

auriocus · 2025-01-10T08:56:19Z

1. The image is encoded very slowly, it takes like 4 minutes.

yes, this CLIP encoding is quite compute-intensive. Especially with the newest commits where the GPU acceleration was deactivated (because it only ever worked on CUDA and everyone else started complaining), it takes some time. But I also think your image is quite large:

2. I don't get any meaningful output.

See this:

encode_image_with_clip: step 1 of 1 encoded in 234926.53 ms
encode_image_with_clip: all 1 segments encoded in 234926.55 ms
encode_image_with_clip: load_image_size 1536 2048
encode_image_with_clip: image embedding created: 4070 tokens

How did you set the context length? when the image already takes up 4070 tokens, maybe there is nothing left for the prompt and result. I'd first try to downscale the image and see what happens.

isr431 added the enhancement New feature or request label Aug 29, 2024

FelisDwan mentioned this issue Aug 30, 2024

add Qwen2-VL ollama/ollama#6564

Open

ggerganov mentioned this issue Dec 19, 2024

clip : disable GPU support #10896

Merged

Feature Request: Support for Qwen2-VL #9246

Feature Request: Support for Qwen2-VL #9246

Comments

isr431 commented Aug 29, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

chigkim commented Aug 31, 2024 • edited Loading

crzroot commented Aug 31, 2024

suepradun commented Aug 31, 2024

xzlinux commented Aug 31, 2024

yukiarimo commented Aug 31, 2024

XDesktopSoft commented Sep 1, 2024

WildCatApp commented Sep 1, 2024

uestcbraid commented Sep 2, 2024

mrhalyang commented Sep 2, 2024

elyzionz commented Sep 2, 2024

eaucoin commented Sep 2, 2024

Kimizhao commented Sep 3, 2024

enryteam commented Sep 4, 2024

yukiarimo commented Sep 4, 2024

apipino commented Sep 5, 2024

Xhehab commented Sep 5, 2024

Seaman3body commented Sep 5, 2024

zenoverflow commented Sep 5, 2024

whoisltd commented Sep 6, 2024

eav-solution commented Sep 7, 2024

feynmanloo commented Sep 8, 2024

chigkim commented Sep 8, 2024

wmx-github commented Sep 11, 2024

HimariO commented Sep 11, 2024

solangii commented Sep 11, 2024

PredyDaddy commented Sep 12, 2024

shobhit9618 commented Sep 12, 2024

zhouxihong1 commented Sep 13, 2024

sakulall commented Dec 5, 2024

sakulall commented Dec 5, 2024

sakulall commented Dec 5, 2024

huucuong1503 commented Dec 5, 2024

sakulall commented Dec 5, 2024

huucuong1503 commented Dec 6, 2024 • edited Loading

sakulall commented Dec 6, 2024

PredyDaddy commented Dec 6, 2024

huucuong1503 commented Dec 6, 2024

sakulall commented Dec 6, 2024

sakulall commented Dec 6, 2024

gitl33 commented Dec 7, 2024

brianestadimas commented Dec 13, 2024

beginor commented Dec 15, 2024

Vi-cs commented Dec 18, 2024

chigkim commented Dec 19, 2024

bunnyfu commented Dec 19, 2024

chigkim commented Dec 19, 2024

ggerganov commented Dec 19, 2024

remixer-dec commented Dec 19, 2024 • edited Loading

chigkim commented Dec 19, 2024

beginor commented Dec 20, 2024

chigkim commented Dec 20, 2024

gaussiangit commented Dec 22, 2024

chigkim commented Dec 23, 2024

Cheesper commented Dec 24, 2024

embedsri commented Dec 26, 2024

vojtapolasek commented Jan 10, 2025

auriocus commented Jan 10, 2025

chigkim commented Aug 31, 2024 •

edited

Loading

huucuong1503 commented Dec 6, 2024 •

edited

Loading

remixer-dec commented Dec 19, 2024 •

edited

Loading