You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, I believe there's a bug in the evaluation. NegCLIP code is based on open_clip, which in turn supports both the original ViT-B/32 architecture, which uses QuickGELU (they name it ViT-B-32-quickgelu) and their "standard" one (ViT-B-32).
The problem is that ViT-B-32 also seems to have been used for evaluation (by specifying a path to pretrained instead of "openai"). However, this will make QuickGELU not to be used but GELU because the hardcoded if path won't be triggered. And this affects the results. This is an error-prone behavior from open_clip, in my humble opinion. The needed change to fix it would be to use ViT-B-32-quickgelu in the evaluation or use the flag --force-quick-gelu.
How do I know that you ran it this way for evaluation (that you ran into this bug)? Because: When I use GELU, I can reproduce your numbers from Table 6, but when I use QuickGELU, I get different numbers. I'm reproducing the numbers using a fork of open_clip and running my own evaluation of ARO, using the checkpoint you shared.
Numbers for NegCLIP:
VG-Attribution
VG-Relation
Flickr30k-Order
COCO-Order
Reported results from Table 6
71.0
81.0
91.0
86.0
My evaluation using ViT-B-32
70.5
80.1
90.1
87.0
My evaluation using ViT-B-32-quickgelu
72.0
81.8
86.4
82.7
On top of this, the zero-shot results on other benchmarks significantly improve after fixing this bug:
From the author's negclip folk, I found that the negclip model has been fine-tuned upon openai:ViT-B-32 model, where QuickGeLU is applied by default. So, in my opinion, to maintain the consistency between training and evaluation, the negclip model in this repo needs to be initialized with quick_gelu=True during evaluation, which is not set true due to open_clip's behavior.
I also put my own evaluation results on NegCLIP
Model
GELU
VG_rel
VG_attr
Flickr30k_Order
COCO_Order
Avg
NegCLIP-ViT-B-32
False
80.2
70.5
91.0
86.8
82.1
NegCLIP-ViT-B-32
True
81.8
72.1
87.1
82.6
80.9
In addition, aliged to your observation, setting quick_gelu=True is especially critical to the zero-shot recognition performance, for my own results as below.
Hey, I believe there's a bug in the evaluation. NegCLIP code is based on open_clip, which in turn supports both the original ViT-B/32 architecture, which uses QuickGELU (they name it
ViT-B-32-quickgelu
) and their "standard" one (ViT-B-32
).When you use their code, you need to specify the
model
(config) andpretrained
checkpoint, where thepretrained
checkpoint is either a name supported for the given model, or a path. They support "openai" checkpoint for bothViT-B-32
andViT-B-32-quickgelu
, because they hardcode this pretrained checkpoint name to change it to a QuickGELU implementation, regardless of which one of these two was used.The problem is that
ViT-B-32
also seems to have been used for evaluation (by specifying a path topretrained
instead of "openai"). However, this will make QuickGELU not to be used but GELU because the hardcodedif
path won't be triggered. And this affects the results. This is an error-prone behavior from open_clip, in my humble opinion. The needed change to fix it would be to useViT-B-32-quickgelu
in the evaluation or use the flag--force-quick-gelu
.How do I know that you ran it this way for evaluation (that you ran into this bug)? Because: When I use GELU, I can reproduce your numbers from Table 6, but when I use QuickGELU, I get different numbers. I'm reproducing the numbers using a fork of open_clip and running my own evaluation of ARO, using the checkpoint you shared.
Numbers for NegCLIP:
ViT-B-32
ViT-B-32-quickgelu
On top of this, the zero-shot results on other benchmarks significantly improve after fixing this bug:
ViT-B-32
ViT-B-32-quickgelu
BTW, these results also match those reproduced by someone who commented in OpenReview and by #29.
As we can see, the numbers are much closer to the original pre-trained CLIP numbers when fixing this bug.
For the sake of completeness (and to see that my evaluation is consistent), I also reproduced it for the OpenAI-pretrained CLIP ViT-B/32 in ARO
The text was updated successfully, but these errors were encountered: