You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @xiamengzhou and all the contributors,
Thanks for the great paper and repo which greatly facilitate future works. But I have some difficulties reproducing the EM score on NQ dataset of a Llama 2 7B model in Table 2.
After checking your evaluation script https://github.com/princeton-nlp/LLM-Shearing/blob/main/icl_eval/run_eval.sh , I found no difference in my evaluation setup. However, I can only get around $24.32$ EM Score with a Llama 2 chat model (32 shots). I will switch to a base model soon and I feel necessary to confirm if there is potential errors in the EM you report.
Best,
The text was updated successfully, but these errors were encountered:
Hi @xiamengzhou and all the contributors,
Thanks for the great paper and repo which greatly facilitate future works. But I have some difficulties reproducing the EM score on NQ dataset of a Llama 2 7B model in Table 2.
After checking your evaluation script https://github.com/princeton-nlp/LLM-Shearing/blob/main/icl_eval/run_eval.sh , I found no difference in my evaluation setup. However, I can only get around$24.32$ EM Score with a Llama 2 chat model (32 shots). I will switch to a base model soon and I feel necessary to confirm if there is potential errors in the EM you report.
Best,
The text was updated successfully, but these errors were encountered: