When calculating the cosine similarity between the background and text, only the features of the background are extracted? and How to delete the features of the foreground objects? I try to make the foreground object black in the image, and keep the background ,but sometimes CLIP still recognizes that object and make a high scores. So I do not know how did you extract image features from the background of the image.

When calculating the cosine similarity between the background and text, only the features of the background are extracted? and How to delete the features of the foreground objects? I try to make the foreground object black in the image, and keep the background ,but sometimes CLIP still recognizes that object and make a high scores. So I do not know how did you extract image features from the background of the image.
