Any plans on benchmarking generative LLMs? #3

ogencoglu · 2023-06-15T09:22:41Z

Thanks for the great work!

Are there any plans to include LLMs (open source & proprietary) in your benchmark? Something like HELM benchmark would be great. For example, just inference results (with a basic prompt) would be valuable without having to do fine-tuning or in-context learning.

laugustyniak · 2023-06-22T09:37:27Z

Hi @ogencoglu,

This is a great question. We are actively working on it. Still, the biggest question is a seed set of prompts. The zero-shot approach sounds like a good starting point, but it would be valuable to test a few-shot system too, So, we must sample shots for the examples.

Do you have any specific prompts in your mind? We can iterate on this idea.

ogencoglu · 2023-06-22T10:28:11Z

I guess simplest prompt will be:

Passage: <...>
Sentiment:

That's what HELM does if I remember correctly.

laugustyniak · 2023-06-22T12:28:59Z

It will not give you unified completions. It could return ANY text, and it will be hard to compare it to gold labels. You can get completions such as a single negative but also you can get the provided text looks like negative one. Hence, you must provide more descriptive instructions to the model.

Act as a sentiment model and return only one of the classes `negative`, `neutral`, or `positive`. 
Passage: <...>
Sentiment:

However, in this case, we have already chosen some words/instructions, and every part of this instruction could influence the outcome. So, the prompt is a particularly important part of testing LLM. This is one of the reasons why we didn't add it to the benchmarking part right now. It is simply hard to prepare an objective evaluation using LLMs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plans on benchmarking generative LLMs? #3

Any plans on benchmarking generative LLMs? #3

ogencoglu commented Jun 15, 2023

laugustyniak commented Jun 22, 2023

ogencoglu commented Jun 22, 2023

laugustyniak commented Jun 22, 2023

Any plans on benchmarking generative LLMs? #3

Any plans on benchmarking generative LLMs? #3

Comments

ogencoglu commented Jun 15, 2023

laugustyniak commented Jun 22, 2023

ogencoglu commented Jun 22, 2023

laugustyniak commented Jun 22, 2023