- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks;
I don't see why wee need to add this new CLI. The mtmd-cli can do this with -p and --image params
        
          
                tools/mtmd/CMakeLists.txt
              
                Outdated
          
        
      |  | ||
| # JinaCLIP CLI (align style with other targets above) | ||
| set(TARGET llama-jinaclip-cli) | ||
| add_executable (${TARGET} jinaclip-cli.cpp) | ||
| target_link_libraries (${TARGET} PRIVATE common mtmd Threads::Threads) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to merge this with mtmd-cli to avoid the "fragmentation" trap of the old llava-cli binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to merge into llama-mtmd-cli to avoid adding another standalone CLI.
However, Jina embedding could support text and vision embedding with individual model,  differs from the existing mtmd-cli workflow which needs text and vision model in the same time.
I will add a Jina‑specific path in mtmd-cli that supports running with only --mmproj + --image.
        
          
                convert_hf_to_gguf.py
              
                Outdated
          
        
      | self.gguf_writer.add_uint32("clip.vision.image_size", img_sz) | ||
| self.gguf_writer.add_uint32("clip.vision.patch_size", patch_sz) | ||
| self.gguf_writer.add_uint32("clip.vision.embedding_length", n_embd) | ||
| self.gguf_writer.add_uint32("clip.vision.block_count", n_layer) | ||
| self.gguf_writer.add_uint32("clip.vision.projection_dim", proj_dim) | ||
| self.gguf_writer.add_uint32("clip.vision.feed_forward_length", n_ff) | ||
| self.gguf_writer.add_uint32("clip.vision.attention.head_count", n_head) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had specific functions and constants to add these metadata keys. Use them instead
        
          
                convert_hf_to_gguf.py
              
                Outdated
          
        
      |  | ||
| # Top-level direct mappings | ||
| if src_no_vm == 'cls_token': | ||
| return [('v.cls_token', data_torch)] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper mapping instead
        
          
                tools/mtmd/clip.cpp
              
                Outdated
          
        
      | if (!ctx->jinaclip_rope_initialized) { | ||
| const int half_dim = rope_dim / 2; | ||
| std::vector<float> base_freqs(half_dim); | ||
| for (int i = 0; i < half_dim; i++) { | ||
| float arange_val = i * 2.0f; // [0, 2, 4, ..., 30] | ||
| float normalized = arange_val / rope_dim; // [0, 2/32, 4/32, ..., 30/32] | ||
| float theta_powered = powf(freq_base, normalized); // theta^normalized | ||
| base_freqs[i] = 1.0f / theta_powered; // 1.0 / theta^normalized | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're trying to do here, is this just 2D RoPE? (which we already supported)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn’t re‑implementing generic 2D RoPE; it implements JinaCLIP’s VisionRotaryEmbeddingFast.
It uses fractional‑position 2D RoPE (t = arange(ft)/ft * pt) and precomputes a full H×W cos/sin grid; the official 2D RoPE uses integer grid positions (pos_h/pos_w) with ggml_rope_ext and does not include these steps.
This is done to strictly match Jina’s Python semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fractional‑position 2D RoPE (t = arange(ft)/ft * pt)
Based on your code:
time_seq[i] = (float) i / ft_seq_len * pt_seq_len;  // [0, 16/36, 32/36, ..., 560/36]
...
freqs_h[t * half_dim + f] = time_seq[t] * base_freqs[f];
Then why don't we scale base_freqs[f] instead? The third param of ggml_rope_ext, the c tensor (freq_scale) is made for this purpose.
Honestly I think this is just YaRN
        
          
                tools/mtmd/clip.cpp
              
                Outdated
          
        
      | } | ||
|  | ||
| clip_image_u8 resized_keep_ratio; | ||
| image_manipulation::bicubic_pil_resize(*img, resized_keep_ratio, out_w, out_h); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally pre-processing doesn't need to be byte-exact. I would prefer keeping the old bicubic_resize to keep it simple.
| self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3 | ||
|  | ||
| # Jina v3 (RoPE) without LoRA should export as jina-bert-v3 to avoid expecting absolute position embeddings | ||
| try: | ||
| text_cfg = hparams.get("text_config", {}) if isinstance(hparams.get("text_config", {}), dict) else {} | ||
| pe_type = (text_cfg.get("position_embedding_type") or hparams.get("position_embedding_type") or "").lower() | ||
| rope_base = text_cfg.get("rotary_emb_base", hparams.get("rotary_emb_base")) | ||
| name_path = (hparams.get("_name_or_path") or "").lower() | ||
| is_v3 = (pe_type == "rotary" or rope_base is not None) and ("jina" in name_path and "v3" in name_path) | ||
| if is_v3 and not self._lora_names: | ||
| self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain this, first off it breaks jina-embeddings-v3 conversion, secondly jina-clip-v2 looks like it loads jina-embeddings-v3 and uses the retrieval.query LoRA/prompt, but load_trained_adapters set to false suggests it's not applied?
https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38
…_rope_ext; clean dead code/logs - mtmd-cli: remove file output option and writing; keep --embd-output-format for stdout; unify image load via mtmd-helper; style fixes - clip.cpp: unify 2D RoPE via ggml_rope_ext using per-dim c tensors (c_first=1/s, c_second=1/(s*odd)); remove precompute path; drop unused bicubic helpers; silence unused param warnings - mtmd-helper: add noctx bitmap loader used by projector-only path - convert_hf_to_gguf.py: robust JINA_BERT_V3 detection ((is_v3) or LoRA); remove debug prints; keep upstream tokenizer sample intact - cleanup: remove debug artifacts (status md, analyze scripts), align with PR ggml-org#16574 guidance
…fix converter keys, rope_ext + bicubic mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
fd37a5c    to
    9d02918      
    Compare
  
    …fix converter keys, rope_ext + bicubic mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
9d02918    to
    e19eb27      
    Compare
  
    …fix converter keys, rope_ext + bicubic mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
e19eb27    to
    2d8885b      
    Compare
  
    …fix converter keys, rope_ext + bicubic mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
2d8885b    to
    b9f78de      
    Compare
  
    …fix converter keys, rope_ext + bicubic mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
b9f78de    to
    2787888      
    Compare
  
    
Update Notes (2025‑10‑22)
block_count/projection_dim/feed_forward_length/attention.head_count.
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)
Overview
common_embd_normalize(..., 2).llama-jinaclip-cli(built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.Scope of changes
clip.projector_type=jinaclip,clip.vision.rope_theta(configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.clip_n_output_tokens()returns 1 for JinaCLIP;clip_n_mmproj_embd()returns projection_dim.llama-jinaclip-clitarget (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.Validation summary
ci/run.shpasses locally; no ggml op changes in this PR.encode_msand thread scaling; no regression observed. More data can be added if requested.Performance (absolute metrics, CPU-only minimal samples)
GPU group (absolute metrics, minimal samples)
Reproduction (optional)
Minimal commands & data (CPU)
jina-bert-v3.pooling_type = MEAN/CLS/LASTclip.projector_type = jinaclip,clip.vision.rope_theta = 10000(default)CUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa offCUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa offFiles in this PR
convert_hf_to_gguf.pytools/mtmd/clip.cpptools/mtmd/clip-impl.htools/mtmd/jinaclip-cli.cpptools/mtmd/CMakeLists.txt