Siglip2VisionModel weights mismatch #36399

deanflaviano · 2025-02-25T14:35:47Z

System Info

RuntimeError: Error(s) in loading state_dict for Siglip2VisionModel:
	size mismatch for vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([196, 768]) from checkpoint, the shape in current model is torch.Size([256, 768]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

How to reproduce:

pip install git+https://github.com/huggingface/[email protected]

from PIL import Image
import requests
from transformers import AutoProcessor, Siglip2VisionModel

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled features

Same for google/siglip-so400m-patch14-384

Who can help?

pip install git+https://github.com/huggingface/[email protected]

from PIL import Image
import requests
from transformers import AutoProcessor, Siglip2VisionModel

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled features

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

pip install git+https://github.com/huggingface/[email protected]

from PIL import Image
import requests
from transformers import AutoProcessor, Siglip2VisionModel

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled features

Expected behavior

It loads the weights

The text was updated successfully, but these errors were encountered:

deanflaviano · 2025-02-25T14:40:52Z

For context, this works:

The error seems to be in Siglip2VisionModel

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    image_features = model.get_image_features(**inputs)

qubvel · 2025-02-25T14:59:20Z

Hey @deanflaviano, it might be a bit confusing, but

fixed resolution siglip2 is actually SiglipModel (v1) with the same architecture
flexible resolution siglip2 is Siglip2Model (v2) and checkpoint are marked with -naflex suffix

In your particular case, you can load weights with SiglipVisionModel

deanflaviano · 2025-02-25T15:24:43Z

Alright, thanks.

Btw, it looks like the documentation is using:

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")

deanflaviano added the bug label Feb 25, 2025

qubvel added the Vision label Feb 25, 2025

deanflaviano closed this as completed Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Siglip2VisionModel weights mismatch #36399

Siglip2VisionModel weights mismatch #36399

deanflaviano commented Feb 25, 2025 •

edited

Loading

deanflaviano commented Feb 25, 2025

qubvel commented Feb 25, 2025 •

edited

Loading

deanflaviano commented Feb 25, 2025

Siglip2VisionModel weights mismatch #36399

Siglip2VisionModel weights mismatch #36399

Comments

deanflaviano commented Feb 25, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

deanflaviano commented Feb 25, 2025

qubvel commented Feb 25, 2025 • edited Loading

deanflaviano commented Feb 25, 2025

deanflaviano commented Feb 25, 2025 •

edited

Loading

qubvel commented Feb 25, 2025 •

edited

Loading