You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for the great work! It is inspiring. I tested almost all the available VLMs for my project, which needs a model that understands spatial relationships of the objects in a scene. My testing sample is an image of a person in one side of a room, pointing at an object on the other side of the room, with several objects in between. Human can understand it easily, but no VLM could get it but moondream2! It is magical.
My question is how moondream2 knows spatial relationships very well? At first, I thought it was because of the region_model and FourierFeatures module, by adding the numerical representations of the spatial relationships of the regions to the model, but you mentioned in a response to an issue that these modules are not integrated with the current version. My next guess was about the Vision Transformer architecture, the position embeddings, and the SigLIP initialization. Could these factors be contributing to moondream2's impressive spatial reasoning capabilities?
I would be grateful if you could shed some light on this aspect of the model's functionality.
Thank you!
The text was updated successfully, but these errors were encountered:
Hi @vikhyat,
Thank you for the great work! It is inspiring. I tested almost all the available VLMs for my project, which needs a model that understands spatial relationships of the objects in a scene. My testing sample is an image of a person in one side of a room, pointing at an object on the other side of the room, with several objects in between. Human can understand it easily, but no VLM could get it but moondream2! It is magical.
My question is how moondream2 knows spatial relationships very well? At first, I thought it was because of the region_model and FourierFeatures module, by adding the numerical representations of the spatial relationships of the regions to the model, but you mentioned in a response to an issue that these modules are not integrated with the current version. My next guess was about the Vision Transformer architecture, the position embeddings, and the SigLIP initialization. Could these factors be contributing to moondream2's impressive spatial reasoning capabilities?
I would be grateful if you could shed some light on this aspect of the model's functionality.
Thank you!
The text was updated successfully, but these errors were encountered: