Paper segment: Score aggregation #839

KennethEnevoldsen · 2024-05-28T10:42:30Z

The goal of this section is to find a meaningful approach to aggregate scores across tasks.

Related to #837
Already discussed in #752

I believe the task is as follows:

get a meaningful sample to test on (see Paper segment: Task selection #837)
find a reasonable set of approaches to test (and discuss pros and cons beforehand) - feel free to add more here:

A relevant method: Seems very relevant, a few issues that I am unsure of is: How does it handle adding in a task very correlated with another (e.g. for 10 sentiment vs 1 retrieval).
relevant discussion: https://www.safe.ai/blog/devising-ml-metrics
Relevant paper on measuring intelligence/generalizability of systems (part 1) also more generally related is the works of José Hernández-Orallo.

likely experiment with multiple and decide the system of choice

vaibhavad · 2024-05-31T00:45:53Z

The paper referenced above provides a nice starting point for an evaluation metric that can substitute averaging. The proposed method uses instance-level and task-level rankings rather than scores to compute final system-level scores and ranks.

I am not sure how practically feasible will it be to store every instance-level performance for each task. Using ranking on tasks does have its limitations, like not being sensitive to difficulty of the task, or the small differences in performance being ignored (as discussed in #752 ).

I am trying thinking of a solution that resolves these issues, while still being better than taking mean. I have a half-baked idea, and it will be helpful to get some feedback on it.

Essentially, I think we should be using the distribution of model scores on a task to evaluate a new model. The distribution of scores is also how we ourselves judge the difficulty of the task. If the distribution is peaked at a low score, it means it is a difficult task.

So the evaluation that I have in mind is something like this - first we'll have a set of reference models. For any task, we'll have a distribution of their scores. Then we will estimate the parameters of that distribution (this might require some assumptions such as the underlying distribution is Gaussian or something else). The score of a new candidate model will be the percentile on that distribution. That way, a model which makes a breakthrough on a difficult task will be appropriately awarded (high percentile). This method still retains benefits of comparing against systems instead of averaging scores, while also quantifying the task difficulty.

As I said before, this is still a half-baked idea. Intuitively, I feel it still retains nice theoretical properties of Borda’s count (proposed in the paper), but we may have to formally and empirically prove that.

One drawback of this evaluation is the selection of reference models. For benchmarking purposes, we'll have to keep it fixed, however, the benchmark will evolve over time and the set of reference models may not be a good representative set over time.

Curious to know your thoughts on this

KennethEnevoldsen · 2024-05-31T13:20:19Z

I am not sure how practically feasible will it be to store every instance-level performance for each task.

We do not have instance level rank, but for some tasks, we have repeated (typically 10 to calculate std and ci). I don't think it is feasible at least for the current iteration of the benchmark.

I am trying thinking of a solution that resolves these issues, while still being better than taking mean. I have a half-baked idea, and it will be helpful to get some feedback on it.

I too have an idea and it might be worth a joint discussion on these ideas.

If the distribution is peaked at a low score, it means it is a difficult task.

Or that there is a lot of noise (performance can't get any better) - I am unsure how to differentiate the two.

first we'll have a set of reference models

We have discussed something like this for the ScandEval NLU benchmark. However, choosing a reference is quite hard.

I believe our three options are:

take the top model for a specific task (might not be representative)
Use all models - the whole distribution (could e.g. be unduly influenced by e.g. a lot of poorly performing models)
Using a representative set of reference models (this introduced a problem of representativeness)

Another approach that I will also add to the table is modeling it as a generalization factor (a latent factor similar to IQ). This also allows for some hypothesis testing, e.g. do we believe that there is one underlying "language understanding factor" or do we believe that a model has multiple e.g. for language groups or for specific tasks?

the generalization factor mentioned in Paper segment: Task selection #837 is an example of such an approach

Also worth mentioning is that there is no reason why we should require only one metric. We should just have a default in the dashboard.

vaibhavad · 2024-06-12T20:31:03Z

After implementing Borda's count as a ranking mechanism, here is the change in rank for the top 20 models in the current leaderboard. The script is here.

Model	Overall score	Borda score	Original Rank	Borda Rank	Change in Rank
nvidia/NV-Embed-v1	69.3186	873	1	3	-2
voyage-large-2-instruct	68.2793	874	2	4	-2
Linq-AI-Research/Linq-Embed-Mistral	68.1745	570	3	1	2
Salesforce/SFR-Embedding-Mistral	67.557	652	4	2	2
gte-Qwen1.5-7B-instruct	67.3437	1147	5	8	-3
Alibaba-NLP/gte-Qwen1.5-7B-instruct	67.3436	1148	6	9	-3
voyage-lite-02-instruct	67.127	1153	7	10	-3
GritLM/GritLM-7B	66.7634	1042	8	6	2
intfloat/e5-mistral-7b-instruct	66.6334	908	9	5	4
google-gecko.text-embedding-preview-0409	66.3136	1082	10	7	3
GritLM/GritLM-8x7B	65.6568	1365	11	12	-1
Alibaba-NLP/gte-large-en-v1.5	65.3905	1783	12	25	-13
LLM2Vec-Meta-Llama-3-supervised	65.0057	1686	13	21	-8
LLM2Vec-Mistral-supervised	64.8018	1679	14	20	-6
jspringer/echo-mistral-7b-instruct-lasttoken	64.6837	1723	15	23	-8
mixedbread-ai/mxbai-embed-large-v1	64.683	1334	16	11	5
WhereIsAI/UAE-Large-V1	64.6357	1399	17	13	4
text-embedding-3-large	64.5896	1877	18	28	-10
voyage-lite-01-instruct	64.4916	1795	19	26	-7
Cohere/Cohere-embed-english-v3.0	64.4743	1635	20	17	3

There is some shuffling in the top 10, but as a set, the same 10 models remain in the top 10. The shift in ranks is much more prominent in models beyond the top 10.

sivareddyg · 2024-06-14T10:07:44Z

@vaibhavad can you also have a column with actual scores

vaibhavad · 2024-06-14T15:40:25Z

@sivareddyg - I updated the comment above with actual scores

KennethEnevoldsen · 2024-06-15T14:40:34Z

Should we add bordo count as well (I want to see how well it gives a notion of closeness).

Another point is that I don't believe this metric considers task correlation. Which in the context of voting is fine (that is what we want), but in the context of model development, we don't want to bias our model ranking toward medical just because we include both MedrxivClusteringS2S and MedrxivClusteringP2P.

Allowing for a silly example, but which I believe is adequate here: If we want to estimate a person's height (the models' ranking), getting the height of their right leg (task A) is a good first step. However, adding the second leg (task B) shouldn't add much information to our estimate of the height (rank). Getting the torso (task C), though, should add more. Thus, assuming equal weight in votes seems problematic in our case as some of the votes supply the same information.

It would be another thing if we believed our distribution of tasks represented the real-world use cases (which I don't believe is the case).

Why does this become important? When we, e.g., in #837, filter out correlated tasks (implicitly or explicitly), we believe that we don't lose too much information, but that might change the rank meaningfully (we can test this).

A simple solution is, of course, filtering tasks before we do the bordo count. However, it does annoy me that the metric is sensitive to adding correlated tasks (which should really only increase the certainty in our estimate, not make it poorer).

I might be missing something here, do let me know if that is the case.

KennethEnevoldsen · 2024-06-16T11:23:12Z

Here is a proposed alternative. Modeling it as a latent generalization factor;

Where for a given task $t$ we estimate its performance as:

$S_t \sim Beta(\alpha, \beta)$

Where alpha and beta are parametrised as:
$\beta = \sigma(g_m) \cdot \phi_t$

$\alpha = (1 - \sigma(g_m)) \cdot \phi_t$

where $g_m$ is the g factor of the model $m$ (note that this is quite similar to beta regression and IQ model for humans). Note this model can be expanded to e.g. include the models a g factor for specific domains or task types.

Comparing the correlation we get:

This further gives us the option to compare models as distributions (estimates of uncertainty)

I tried it using the task reduction as well:

6 best, 14 best, 14 random

Surprisingly, 14 random tasks give a higher (Pearson) correlation with the original score. This is probably due to multiple of the tasks which are the easiest to predict and also the ones that correlate well with other tasks.

vaibhavad · 2024-06-21T19:04:37Z

Correlation of borda count to mean averaging

KennethEnevoldsen · 2024-06-22T13:48:23Z

That does look fairly reasonable as well.

A actually see that we are kinda going at this from two approaches:

Social choice theory / election theory etc.: Which predominantly is concern with ranking of deciding candidates. In general many choices here often sacrifice some important aspects (e.g. see arrow impossibility theorem). However in our case some of these aspects does apply in our "voting system" e.g. we do not believe that any of our "voters" have agency and thus can't attempt to "cheat" the system. Thus we might go through the existing choices and select the most appropriate ones.
Psychology / intelligence littérature: Where a latent factor is intended to be measured, which determined model generalization/quality etc.

(1) s generally geared toward selecting the most preferred model by all task while (2) seeks to estimate the model generalisation capability. Luckily atm. the two approaches seem to in general agree, however does rely on quite different assumptions. (2) E.g. has the ability to determine if a task a relevant to gain more information about the latent factor while in (1) tasks (voters) are seem as equals (following democratic ideals)

I think this is a very reasonable thing bring up during the writing of the section.

KennethEnevoldsen assigned KennethEnevoldsen and vaibhavad May 28, 2024

This was referenced May 31, 2024

Finalizing MMTEB #784

Open

Average scores on multilingual tasks? #117

Open

vaibhavad mentioned this issue Jun 6, 2024

Paper segment: Task selection #837

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper segment: Score aggregation #839

Paper segment: Score aggregation #839

KennethEnevoldsen commented May 28, 2024 •

edited

Loading

vaibhavad commented May 31, 2024

KennethEnevoldsen commented May 31, 2024

vaibhavad commented Jun 12, 2024 •

edited

Loading

sivareddyg commented Jun 14, 2024

vaibhavad commented Jun 14, 2024

KennethEnevoldsen commented Jun 15, 2024

KennethEnevoldsen commented Jun 16, 2024

vaibhavad commented Jun 21, 2024 •

edited

Loading

KennethEnevoldsen commented Jun 22, 2024

Paper segment: Score aggregation #839

Paper segment: Score aggregation #839

Comments

KennethEnevoldsen commented May 28, 2024 • edited Loading

vaibhavad commented May 31, 2024

KennethEnevoldsen commented May 31, 2024

vaibhavad commented Jun 12, 2024 • edited Loading

sivareddyg commented Jun 14, 2024

vaibhavad commented Jun 14, 2024

KennethEnevoldsen commented Jun 15, 2024

KennethEnevoldsen commented Jun 16, 2024

vaibhavad commented Jun 21, 2024 • edited Loading

KennethEnevoldsen commented Jun 22, 2024

KennethEnevoldsen commented May 28, 2024 •

edited

Loading

vaibhavad commented Jun 12, 2024 •

edited

Loading

vaibhavad commented Jun 21, 2024 •

edited

Loading