-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any plans on benchmarking generative LLMs? #3
Comments
Hi @ogencoglu, This is a great question. We are actively working on it. Still, the biggest question is a seed set of prompts. The zero-shot approach sounds like a good starting point, but it would be valuable to test a few-shot system too, So, we must sample shots for the examples. Do you have any specific prompts in your mind? We can iterate on this idea. |
I guess simplest prompt will be:
That's what HELM does if I remember correctly. |
It will not give you unified completions. It could return ANY text, and it will be hard to compare it to gold labels. You can get completions such as a single
However, in this case, we have already chosen some words/instructions, and every part of this instruction could influence the outcome. So, the prompt is a particularly important part of testing LLM. This is one of the reasons why we didn't add it to the benchmarking part right now. It is simply hard to prepare an objective evaluation using LLMs. |
Thanks for the great work!
Are there any plans to include LLMs (open source & proprietary) in your benchmark? Something like HELM benchmark would be great. For example, just inference results (with a basic prompt) would be valuable without having to do fine-tuning or in-context learning.
The text was updated successfully, but these errors were encountered: