Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans on benchmarking generative LLMs? #3

Open
ogencoglu opened this issue Jun 15, 2023 · 3 comments
Open

Any plans on benchmarking generative LLMs? #3

ogencoglu opened this issue Jun 15, 2023 · 3 comments

Comments

@ogencoglu
Copy link

Thanks for the great work!

Are there any plans to include LLMs (open source & proprietary) in your benchmark? Something like HELM benchmark would be great. For example, just inference results (with a basic prompt) would be valuable without having to do fine-tuning or in-context learning.

@laugustyniak
Copy link
Collaborator

Hi @ogencoglu,

This is a great question. We are actively working on it. Still, the biggest question is a seed set of prompts. The zero-shot approach sounds like a good starting point, but it would be valuable to test a few-shot system too, So, we must sample shots for the examples.

Do you have any specific prompts in your mind? We can iterate on this idea.

@ogencoglu
Copy link
Author

I guess simplest prompt will be:

Passage: <...>
Sentiment:

That's what HELM does if I remember correctly.

@laugustyniak
Copy link
Collaborator

It will not give you unified completions. It could return ANY text, and it will be hard to compare it to gold labels. You can get completions such as a single negative but also you can get the provided text looks like negative one. Hence, you must provide more descriptive instructions to the model.

Act as a sentiment model and return only one of the classes `negative`, `neutral`, or `positive`. 
Passage: <...>
Sentiment:

However, in this case, we have already chosen some words/instructions, and every part of this instruction could influence the outcome. So, the prompt is a particularly important part of testing LLM. This is one of the reasons why we didn't add it to the benchmarking part right now. It is simply hard to prepare an objective evaluation using LLMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants