-
Notifications
You must be signed in to change notification settings - Fork 163
Qwen3-VL minor fix/stopgap for M-RoPE implementation #988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Ensure n_batch does not exceed n_ubatch to prevent llama_decode from splitting the batch, which would break M-RoPE positional embeddings during image chunk decoding.
|
Is it the same issue ggml-org/llama.cpp#13694? |
yea same underlying issue with the extra dimensions from M-RoPE. this PR gets qwen3vl models working but not sure if/what it might break for other models using mrope. im yet to finish going through the exact timeline of commits from the squashed d261223 merge on mainline that added initial qwen3vl support but I would honestly be pretty out of my depth on most of it anyway so if you/anyone wants to point me in the right direction, ill gladly accept the guidance :) there was 3-4+ different directions they tried going with for mrope impl, partly so that support for other mtmd models didnt break (i think they ended up storing positions in KV cells for proper 2D causal masking). but I have no idea how much harder/easier it would be to implement in ik_llama.cpp |
|
Wouldn't it be better to put the Using I guess we can merge this as a temporary workaround, but then one needs to pick up all other changes that have been added in mainline. |
|
I have a branch that merges most changes of this from mainline, but it lacks one commit that I don't know how to port. I can push it as a draft. |
Which is the missing commit and what is the problem porting it? |
|
ggml-org/llama.cpp#16825. I'm not familiar with kv cache and there were a lot of refactors done to it. Also flash attention in clip is not working, but I just disable it for now. |
|
It has #988 which improves M-RoPE too |
|
yea this is definitely only a slight improvement and would need to figure out how best to translate the other mainline changes for ik_llama.cpp implementation - assuming there is enough interest in supporting the qwen3 vision models here. i will share the rough list of relevant llama.cpp commits I had identified so far in #993, in case thats of any help - just wasnt sure how much time/interest you guys had spare to work on it, and not confident enough to do it myself :) |
TLDR: fixes a bug where
n_ubatch < n_batchcausesllama_decodeto internally split batches during image/audio chunk processing, which breaks M-RoPE positional embeddings. This PR "fixes" the issue by cappingn_batchton_ubatchbefore calculating batch splits so that M-RoPE data stays intact.Context:
I have been getting terrible results across multiple Qwen3-VL models, with many different variations of
llama-serverandllama-mtmd-cliparameters, when trying to extract 2d bbox coords from single images. Most of the time, models would seem to describe the general contents of the image okay, but then just provide bboxes that weren't even close, despite the exact same models working fine via llama.cpp. I had even started working on trying to port across the ~20 multimodal/qwen3vl-related commits from mainline llama.cpp, but had random thought to check this after the first few hrs lol. Anyway, until I, or whoever, actually brings over the real fixes for Qwen3-VL implementation, this PR at least helps anyone using default ubatch size with image inputs - especially considering Qwen team have said that any grounding tasks suck with fewer than ~1024 image tokens.I tested it all morning pretty much, with all different combinations of settings, but these show the main idea hopefully - all generated with the same command save for
-ub(e.g.& .\llama-server.exe -m "X:\Models\unsloth\Qwen3-VL-8B-Instruct\Qwen3-VL-8B-Instruct-Q8_0.gguf" --mmproj "X:\Models\unsloth\Qwen3-VL-8B-Instruct\mmproj-F32.gguf" -ngl 999 -t 10 -tb 12 -c 8192 --jinja --temp 0 --top-k 1 --presence-penalty 1.5 --samplers 'top_k;temperature' --mlock -b 2048 -ub 512):Main branch:
-b 2048 -ub 512(default)-b 2048 -ub 2048PR
-b 2048 -ub 512(default)-b 2048 -ub 2048(exact same output as main branch)llama.cpp main branch (for reference)