Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Siglip2VisionModel weights mismatch #36399

Closed
1 of 4 tasks
deanflaviano opened this issue Feb 25, 2025 · 3 comments
Closed
1 of 4 tasks

Siglip2VisionModel weights mismatch #36399

deanflaviano opened this issue Feb 25, 2025 · 3 comments

Comments

@deanflaviano
Copy link

deanflaviano commented Feb 25, 2025

System Info

RuntimeError: Error(s) in loading state_dict for Siglip2VisionModel:
	size mismatch for vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([768, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([196, 768]) from checkpoint, the shape in current model is torch.Size([256, 768]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

How to reproduce:

pip install git+https://github.com/huggingface/[email protected]

from PIL import Image
import requests
from transformers import AutoProcessor, Siglip2VisionModel

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled features

Same for google/siglip-so400m-patch14-384

Who can help?

pip install git+https://github.com/huggingface/[email protected]

from PIL import Image
import requests
from transformers import AutoProcessor, Siglip2VisionModel

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled features

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pip install git+https://github.com/huggingface/[email protected]

from PIL import Image
import requests
from transformers import AutoProcessor, Siglip2VisionModel

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output  # pooled features

Expected behavior

It loads the weights

@deanflaviano
Copy link
Author

For context, this works:

The error seems to be in Siglip2VisionModel

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    image_features = model.get_image_features(**inputs)

@qubvel
Copy link
Member

qubvel commented Feb 25, 2025

Hey @deanflaviano, it might be a bit confusing, but

  • fixed resolution siglip2 is actually SiglipModel (v1) with the same architecture
  • flexible resolution siglip2 is Siglip2Model (v2) and checkpoint are marked with -naflex suffix

In your particular case, you can load weights with SiglipVisionModel

@qubvel qubvel added the Vision label Feb 25, 2025
@deanflaviano
Copy link
Author

Alright, thanks.

Btw, it looks like the documentation is using:

model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants