Skip to content

Reproducing Table 1 on TOFU, and metrics clarification #7

@HaoranTang

Description

@HaoranTang

Dear SimNPO authors,

Thank you for your insightful work and open-sourcing the code base. Would you mind sharing the hyper-parameter settings for reproducing NPO and SimNPO results on the TOFU dataset (results in Table 1)? I am using default finetune and forget configs (.yaml) to reproduce the results but I notice a significant discrepancy between the evaluation scores (in aggregate_stat.txt) and the reported results. Using SimNPO + grad_diff, I get a matched model utility score (0.578) but the forget quality (KS Test p value) is quite low (4e-6) comparing the reported 0.99. Using SimNPO alone I get ~0.4 forget quality but ~0 model utility. Haven't changed code or default configs.

Referring to the evaluation section in TOFU paper, should the lower the better for the probability p(a|q) and ROUGE on the forget set, and the higher the better for the truth ratio (prob. wrong answer / prob. correct answers) on the forget set? The direction of arrows in Tab 1 for ROUGE and prob. on forget set seem reversed, and the truth ratio arrow on the retain set is reversed too?

Looking forward to hearing from you, thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions