Skip to content

Conversation

@i4TsU
Copy link

@i4TsU i4TsU commented Nov 20, 2025

TLDR: fixes a bug where n_ubatch < n_batch causes llama_decode to internally split batches during image/audio chunk processing, which breaks M-RoPE positional embeddings. This PR "fixes" the issue by capping n_batch to n_ubatch before calculating batch splits so that M-RoPE data stays intact.

Context:
I have been getting terrible results across multiple Qwen3-VL models, with many different variations of llama-server and llama-mtmd-cli parameters, when trying to extract 2d bbox coords from single images. Most of the time, models would seem to describe the general contents of the image okay, but then just provide bboxes that weren't even close, despite the exact same models working fine via llama.cpp. I had even started working on trying to port across the ~20 multimodal/qwen3vl-related commits from mainline llama.cpp, but had random thought to check this after the first few hrs lol. Anyway, until I, or whoever, actually brings over the real fixes for Qwen3-VL implementation, this PR at least helps anyone using default ubatch size with image inputs - especially considering Qwen team have said that any grounding tasks suck with fewer than ~1024 image tokens.


I tested it all morning pretty much, with all different combinations of settings, but these show the main idea hopefully - all generated with the same command save for -ub (e.g. & .\llama-server.exe -m "X:\Models\unsloth\Qwen3-VL-8B-Instruct\Qwen3-VL-8B-Instruct-Q8_0.gguf" --mmproj "X:\Models\unsloth\Qwen3-VL-8B-Instruct\mmproj-F32.gguf" -ngl 999 -t 10 -tb 12 -c 8192 --jinja --temp 0 --top-k 1 --presence-penalty 1.5 --samplers 'top_k;temperature' --mlock -b 2048 -ub 512):

Main branch:

-b 2048 -ub 512 (default)

ub512main

-b 2048 -ub 2048

ub2048main

PR

-b 2048 -ub 512 (default)

ub512pr

-b 2048 -ub 2048 (exact same output as main branch)

ub2048pr

llama.cpp main branch (for reference)

llamacpp

Ensure n_batch does not exceed n_ubatch to prevent llama_decode from
splitting the batch, which would break M-RoPE positional embeddings
during image chunk decoding.
@firecoperana
Copy link
Collaborator

firecoperana commented Nov 20, 2025

Is it the same issue ggml-org/llama.cpp#13694?

@i4TsU
Copy link
Author

i4TsU commented Nov 20, 2025

Is it the same issue ggml-org/llama.cpp#13694?

yea same underlying issue with the extra dimensions from M-RoPE. this PR gets qwen3vl models working but not sure if/what it might break for other models using mrope. im yet to finish going through the exact timeline of commits from the squashed d261223 merge on mainline that added initial qwen3vl support but I would honestly be pretty out of my depth on most of it anyway so if you/anyone wants to point me in the right direction, ill gladly accept the guidance :) there was 3-4+ different directions they tried going with for mrope impl, partly so that support for other mtmd models didnt break (i think they ended up storing positions in KV cells for proper 2D causal masking). but I have no idea how much harder/easier it would be to implement in ik_llama.cpp

@ikawrakow
Copy link
Owner

Wouldn't it be better to put the n_batch change in mtmd-cli, and issue a warning that this is being done?

Using n_batch = n_ubatch seems to improve things, but it does not look like it is working very well (based on the screenshots).

I guess we can merge this as a temporary workaround, but then one needs to pick up all other changes that have been added in mainline.

@firecoperana
Copy link
Collaborator

I have a branch that merges most changes of this from mainline, but it lacks one commit that I don't know how to port. I can push it as a draft.

@ikawrakow
Copy link
Owner

I have a branch that merges most changes of this from mainline, but it lacks one commit that I don't know how to port

Which is the missing commit and what is the problem porting it?

@firecoperana
Copy link
Collaborator

ggml-org/llama.cpp#16825. I'm not familiar with kv cache and there were a lot of refactors done to it. Also flash attention in clip is not working, but I just disable it for now.

@firecoperana
Copy link
Collaborator

It has #988 which improves M-RoPE too

@i4TsU
Copy link
Author

i4TsU commented Nov 21, 2025

yea this is definitely only a slight improvement and would need to figure out how best to translate the other mainline changes for ik_llama.cpp implementation - assuming there is enough interest in supporting the qwen3 vision models here. i will share the rough list of relevant llama.cpp commits I had identified so far in #993, in case thats of any help - just wasnt sure how much time/interest you guys had spare to work on it, and not confident enough to do it myself :)

i4TsU added a commit to i4TsU/ik_llama.cpp that referenced this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants