Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Objects spatial understanding #166

Open
majidnasr opened this issue Dec 5, 2024 · 0 comments
Open

Objects spatial understanding #166

majidnasr opened this issue Dec 5, 2024 · 0 comments

Comments

@majidnasr
Copy link

Hi @vikhyat,

Thank you for the great work! It is inspiring. I tested almost all the available VLMs for my project, which needs a model that understands spatial relationships of the objects in a scene. My testing sample is an image of a person in one side of a room, pointing at an object on the other side of the room, with several objects in between. Human can understand it easily, but no VLM could get it but moondream2! It is magical.

My question is how moondream2 knows spatial relationships very well? At first, I thought it was because of the region_model and FourierFeatures module, by adding the numerical representations of the spatial relationships of the regions to the model, but you mentioned in a response to an issue that these modules are not integrated with the current version. My next guess was about the Vision Transformer architecture, the position embeddings, and the SigLIP initialization. Could these factors be contributing to moondream2's impressive spatial reasoning capabilities?

I would be grateful if you could shed some light on this aspect of the model's functionality.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant