Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect docstring of get_anyres_image_grid_shape #31588

Open
DarkLight1337 opened this issue Jun 25, 2024 · 4 comments
Open

Incorrect docstring of get_anyres_image_grid_shape #31588

DarkLight1337 opened this issue Jun 25, 2024 · 4 comments

Comments

@DarkLight1337
Copy link

Upon inspecting the source code, the image_size tuple should be in the form (height, width) instead of (width, height)

The size of the input image in the format (width, height).

@amyeroberts
Copy link
Collaborator

@DarkLight1337 Would you like to open a PR to fix this?

cc @zucchini-nlp To confirm, as I think this was raised elsewhere and there's a double inversion which happens (?)

@DarkLight1337
Copy link
Author

After looking at the code a bit more, now I am more confused. It seems that LLaVA-NeXT model treats it as (width, height) but still works correctly. Or is that just incorrect variable naming?

@zucchini-nlp
Copy link
Member

Hey!

Yes, this issue has been noticed by several people and I can confirm that our implementation matched perfectly with the LLaVa-NeXT. Yes, there are naming discrepancies between the two, which is confusing but it all comes from the way it's done in the original repo.

But if we try to get the correct way, the way is should be as I understand, then there is a "bug" in both implementations. Because LLaVa-NeXT treat is as (width, height) up to some point in modeling where the order is swapped back to (height, width) (they permute image to "height, width" and not "width, height").

I raised a question to LLaVa authors a week ago and didn't get a reply yet. So I wouldn't change anything in transformers until authors confirm it's a bug and not an intended thing. Only thing I can do is add a small comment in code clarifying the point. I could align naming with LLaVa-NeXT repo by using (width, height) order in processing, but that would raise more questions about why we use incorrect order while image-processing

Hope this clarifies it a bit ;)

@DarkLight1337
Copy link
Author

Thanks for the clarification! Let's wait until the authors respond then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants