-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Dear SimNPO authors,
Thank you for your insightful work and open-sourcing the code base. Would you mind sharing the hyper-parameter settings for reproducing NPO and SimNPO results on the TOFU dataset (results in Table 1)? I am using default finetune and forget configs (.yaml) to reproduce the results but I notice a significant discrepancy between the evaluation scores (in aggregate_stat.txt) and the reported results. Using SimNPO + grad_diff, I get a matched model utility score (0.578) but the forget quality (KS Test p value) is quite low (4e-6) comparing the reported 0.99. Using SimNPO alone I get ~0.4 forget quality but ~0 model utility. Haven't changed code or default configs.
Referring to the evaluation section in TOFU paper, should the lower the better for the probability p(a|q) and ROUGE on the forget set, and the higher the better for the truth ratio (prob. wrong answer / prob. correct answers) on the forget set? The direction of arrows in Tab 1 for ROUGE and prob. on forget set seem reversed, and the truth ratio arrow on the retain set is reversed too?
Looking forward to hearing from you, thanks.