Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper segment: Score aggregation #839

Open
KennethEnevoldsen opened this issue May 28, 2024 · 9 comments
Open

Paper segment: Score aggregation #839

KennethEnevoldsen opened this issue May 28, 2024 · 9 comments
Assignees

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 28, 2024

The goal of this section is to find a meaningful approach to aggregate scores across tasks.

Related to #837
Already discussed in #752

I believe the task is as follows:

  1. get a meaningful sample to test on (see Paper segment: Task selection #837)
  2. find a reasonable set of approaches to test (and discuss pros and cons beforehand) - feel free to add more here:
  1. likely experiment with multiple and decide the system of choice
@vaibhavad
Copy link
Contributor

The paper referenced above provides a nice starting point for an evaluation metric that can substitute averaging. The proposed method uses instance-level and task-level rankings rather than scores to compute final system-level scores and ranks.

I am not sure how practically feasible will it be to store every instance-level performance for each task. Using ranking on tasks does have its limitations, like not being sensitive to difficulty of the task, or the small differences in performance being ignored (as discussed in #752 ).

I am trying thinking of a solution that resolves these issues, while still being better than taking mean. I have a half-baked idea, and it will be helpful to get some feedback on it.

Essentially, I think we should be using the distribution of model scores on a task to evaluate a new model. The distribution of scores is also how we ourselves judge the difficulty of the task. If the distribution is peaked at a low score, it means it is a difficult task.

So the evaluation that I have in mind is something like this - first we'll have a set of reference models. For any task, we'll have a distribution of their scores. Then we will estimate the parameters of that distribution (this might require some assumptions such as the underlying distribution is Gaussian or something else). The score of a new candidate model will be the percentile on that distribution. That way, a model which makes a breakthrough on a difficult task will be appropriately awarded (high percentile). This method still retains benefits of comparing against systems instead of averaging scores, while also quantifying the task difficulty.

As I said before, this is still a half-baked idea. Intuitively, I feel it still retains nice theoretical properties of Borda’s count (proposed in the paper), but we may have to formally and empirically prove that.

One drawback of this evaluation is the selection of reference models. For benchmarking purposes, we'll have to keep it fixed, however, the benchmark will evolve over time and the set of reference models may not be a good representative set over time.

Curious to know your thoughts on this

@KennethEnevoldsen
Copy link
Contributor Author

I am not sure how practically feasible will it be to store every instance-level performance for each task.

We do not have instance level rank, but for some tasks, we have repeated (typically 10 to calculate std and ci). I don't think it is feasible at least for the current iteration of the benchmark.

I am trying thinking of a solution that resolves these issues, while still being better than taking mean. I have a half-baked idea, and it will be helpful to get some feedback on it.

I too have an idea and it might be worth a joint discussion on these ideas.

If the distribution is peaked at a low score, it means it is a difficult task.

Or that there is a lot of noise (performance can't get any better) - I am unsure how to differentiate the two.

first we'll have a set of reference models

We have discussed something like this for the ScandEval NLU benchmark. However, choosing a reference is quite hard.

I believe our three options are:

  • take the top model for a specific task (might not be representative)
  • Use all models - the whole distribution (could e.g. be unduly influenced by e.g. a lot of poorly performing models)
  • Using a representative set of reference models (this introduced a problem of representativeness)

Another approach that I will also add to the table is modeling it as a generalization factor (a latent factor similar to IQ). This also allows for some hypothesis testing, e.g. do we believe that there is one underlying "language understanding factor" or do we believe that a model has multiple e.g. for language groups or for specific tasks?

Also worth mentioning is that there is no reason why we should require only one metric. We should just have a default in the dashboard.

@vaibhavad
Copy link
Contributor

vaibhavad commented Jun 12, 2024

After implementing Borda's count as a ranking mechanism, here is the change in rank for the top 20 models in the current leaderboard. The script is here.

Model Overall score Borda score Original Rank Borda Rank Change in Rank
nvidia/NV-Embed-v1 69.3186 873 1 3 -2
voyage-large-2-instruct 68.2793 874 2 4 -2
Linq-AI-Research/Linq-Embed-Mistral 68.1745 570 3 1 2
Salesforce/SFR-Embedding-Mistral 67.557 652 4 2 2
gte-Qwen1.5-7B-instruct 67.3437 1147 5 8 -3
Alibaba-NLP/gte-Qwen1.5-7B-instruct 67.3436 1148 6 9 -3
voyage-lite-02-instruct 67.127 1153 7 10 -3
GritLM/GritLM-7B 66.7634 1042 8 6 2
intfloat/e5-mistral-7b-instruct 66.6334 908 9 5 4
google-gecko.text-embedding-preview-0409 66.3136 1082 10 7 3
GritLM/GritLM-8x7B 65.6568 1365 11 12 -1
Alibaba-NLP/gte-large-en-v1.5 65.3905 1783 12 25 -13
LLM2Vec-Meta-Llama-3-supervised 65.0057 1686 13 21 -8
LLM2Vec-Mistral-supervised 64.8018 1679 14 20 -6
jspringer/echo-mistral-7b-instruct-lasttoken 64.6837 1723 15 23 -8
mixedbread-ai/mxbai-embed-large-v1 64.683 1334 16 11 5
WhereIsAI/UAE-Large-V1 64.6357 1399 17 13 4
text-embedding-3-large 64.5896 1877 18 28 -10
voyage-lite-01-instruct 64.4916 1795 19 26 -7
Cohere/Cohere-embed-english-v3.0 64.4743 1635 20 17 3

There is some shuffling in the top 10, but as a set, the same 10 models remain in the top 10. The shift in ranks is much more prominent in models beyond the top 10.

@sivareddyg
Copy link

@vaibhavad can you also have a column with actual scores

@vaibhavad
Copy link
Contributor

@sivareddyg - I updated the comment above with actual scores

@KennethEnevoldsen
Copy link
Contributor Author

Should we add bordo count as well (I want to see how well it gives a notion of closeness).

Another point is that I don't believe this metric considers task correlation. Which in the context of voting is fine (that is what we want), but in the context of model development, we don't want to bias our model ranking toward medical just because we include both MedrxivClusteringS2S and MedrxivClusteringP2P.

Allowing for a silly example, but which I believe is adequate here: If we want to estimate a person's height (the models' ranking), getting the height of their right leg (task A) is a good first step. However, adding the second leg (task B) shouldn't add much information to our estimate of the height (rank). Getting the torso (task C), though, should add more. Thus, assuming equal weight in votes seems problematic in our case as some of the votes supply the same information.

It would be another thing if we believed our distribution of tasks represented the real-world use cases (which I don't believe is the case).

Why does this become important? When we, e.g., in #837, filter out correlated tasks (implicitly or explicitly), we believe that we don't lose too much information, but that might change the rank meaningfully (we can test this).

A simple solution is, of course, filtering tasks before we do the bordo count. However, it does annoy me that the metric is sensitive to adding correlated tasks (which should really only increase the certainty in our estimate, not make it poorer).

I might be missing something here, do let me know if that is the case.

@KennethEnevoldsen
Copy link
Contributor Author

Here is a proposed alternative. Modeling it as a latent generalization factor;

Where for a given task $t$ we estimate its performance as:

$S_t \sim Beta(\alpha, \beta)$

Where alpha and beta are parametrised as:
$\beta = \sigma(g_m) \cdot \phi_t$

$\alpha = (1 - \sigma(g_m)) \cdot \phi_t$

where $g_m$ is the g factor of the model $m$ (note that this is quite similar to beta regression and IQ model for humans). Note this model can be expanded to e.g. include the models a g factor for specific domains or task types.

Comparing the correlation we get:

Screenshot 2024-06-16 at 12 25 49

This further gives us the option to compare models as distributions (estimates of uncertainty)

Screenshot 2024-06-16 at 12 32 31

I tried it using the task reduction as well:

6 best, 14 best, 14 random

Screenshot 2024-06-16 at 12 36 31 Screenshot 2024-06-16 at 12 39 18 Screenshot 2024-06-16 at 12 41 20

Surprisingly, 14 random tasks give a higher (Pearson) correlation with the original score. This is probably due to multiple of the tasks which are the easiest to predict and also the ones that correlate well with other tasks.

@vaibhavad
Copy link
Contributor

vaibhavad commented Jun 21, 2024

Correlation of borda count to mean averaging

scatterplot

@KennethEnevoldsen
Copy link
Contributor Author

That does look fairly reasonable as well.

A actually see that we are kinda going at this from two approaches:

  1. Social choice theory / election theory etc.: Which predominantly is concern with ranking of deciding candidates. In general many choices here often sacrifice some important aspects (e.g. see arrow impossibility theorem). However in our case some of these aspects does apply in our "voting system" e.g. we do not believe that any of our "voters" have agency and thus can't attempt to "cheat" the system. Thus we might go through the existing choices and select the most appropriate ones.
  2. Psychology / intelligence littérature: Where a latent factor is intended to be measured, which determined model generalization/quality etc.

(1) s generally geared toward selecting the most preferred model by all task while (2) seeks to estimate the model generalisation capability. Luckily atm. the two approaches seem to in general agree, however does rely on quite different assumptions. (2) E.g. has the ability to determine if a task a relevant to gain more information about the latent factor while in (1) tasks (voters) are seem as equals (following democratic ideals)

I think this is a very reasonable thing bring up during the writing of the section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants