Objects spatial understanding #166

majidnasr · 2024-12-05T15:53:52Z

Thank you for the great work! It is inspiring. I tested almost all the available VLMs for my project, which needs a model that understands spatial relationships of the objects in a scene. My testing sample is an image of a person in one side of a room, pointing at an object on the other side of the room, with several objects in between. Human can understand it easily, but no VLM could get it but moondream2! It is magical.

My question is how moondream2 knows spatial relationships very well? At first, I thought it was because of the region_model and FourierFeatures module, by adding the numerical representations of the spatial relationships of the regions to the model, but you mentioned in a response to an issue that these modules are not integrated with the current version. My next guess was about the Vision Transformer architecture, the position embeddings, and the SigLIP initialization. Could these factors be contributing to moondream2's impressive spatial reasoning capabilities?

I would be grateful if you could shed some light on this aspect of the model's functionality.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Objects spatial understanding #166

Objects spatial understanding #166

majidnasr commented Dec 5, 2024

Objects spatial understanding #166

Objects spatial understanding #166

Comments

majidnasr commented Dec 5, 2024