You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, and thank you for your work on this repository.
I have a question regarding the implementation of the FILIP embedding model in this repository.
In the original FILIP paper, it is mentioned that padding vectors are excluded from similarity computation to prevent performance degradation.
"Unlike Khattab & Zaharia (2020), we discard the padded tokens and use average instead summation of token-wise maximum similarities when computing the image-text alignment, which enhances the cross-modal representation learning and stabilizes training."
However, based on my understanding of the code here, it seems that padding vectors are also being used in the similarity calculation.
In the implementation, FILIP use topk selection in "get_weighted_dense_logits" function of FILIP model.
However, if we use top k value (input argument of get_weighted_dense_logits function) as a larger value than the number of vectors for each text/image sample, then padding vector can be used in the similarity calculation.
And theoretically, selecting top k vectors and dropping vector for padded token is not the same.
I would like to confirm whether my understanding is correct. If padding vectors are indeed included in the similarity computation, could you clarify the reason behind this design choice?
Thank you for your time and support!
The text was updated successfully, but these errors were encountered:
Hello, and thank you for your work on this repository.
I have a question regarding the implementation of the FILIP embedding model in this repository.
In the original FILIP paper, it is mentioned that padding vectors are excluded from similarity computation to prevent performance degradation.
However, based on my understanding of the code here, it seems that padding vectors are also being used in the similarity calculation.
In the implementation, FILIP use topk selection in "get_weighted_dense_logits" function of FILIP model.
However, if we use top k value (input argument of get_weighted_dense_logits function) as a larger value than the number of vectors for each text/image sample, then padding vector can be used in the similarity calculation.
And theoretically, selecting top k vectors and dropping vector for padded token is not the same.
I would like to confirm whether my understanding is correct. If padding vectors are indeed included in the similarity computation, could you clarify the reason behind this design choice?
Thank you for your time and support!
The text was updated successfully, but these errors were encountered: