You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another question: since README says the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5).. Why the training is more efficient?
It seems that image modularization strategy will cost more time or memory usage in image encoding stage(one image are divided into serval parts).
So the efficiency is due to fewer visual tokens(perceiver than mlp projection)? Looking forward to your reply :)
It seems that$| \log r|$ should be $|\log {\frac {W_I}{H_I} } + \log {\frac {n}{m} }|$
The text was updated successfully, but these errors were encountered: