About the NQ EM Score in Table 2 #72

chuhac · 2024-08-24T04:14:06Z

Hi @xiamengzhou and all the contributors,
Thanks for the great paper and repo which greatly facilitate future works. But I have some difficulties reproducing the EM score on NQ dataset of a Llama 2 7B model in Table 2.

After checking your evaluation script https://github.com/princeton-nlp/LLM-Shearing/blob/main/icl_eval/run_eval.sh , I found no difference in my evaluation setup. However, I can only get around $24.32$ EM Score with a Llama 2 chat model (32 shots). I will switch to a base model soon and I feel necessary to confirm if there is potential errors in the EM you report.

Best,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the NQ EM Score in Table 2 #72

About the NQ EM Score in Table 2 #72

chuhac commented Aug 24, 2024

About the NQ EM Score in Table 2 #72

About the NQ EM Score in Table 2 #72

Comments

chuhac commented Aug 24, 2024