-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paper segment: Score aggregation #839
Comments
The paper referenced above provides a nice starting point for an evaluation metric that can substitute averaging. The proposed method uses instance-level and task-level rankings rather than scores to compute final system-level scores and ranks. I am not sure how practically feasible will it be to store every instance-level performance for each task. Using ranking on tasks does have its limitations, like not being sensitive to difficulty of the task, or the small differences in performance being ignored (as discussed in #752 ). I am trying thinking of a solution that resolves these issues, while still being better than taking mean. I have a half-baked idea, and it will be helpful to get some feedback on it. Essentially, I think we should be using the distribution of model scores on a task to evaluate a new model. The distribution of scores is also how we ourselves judge the difficulty of the task. If the distribution is peaked at a low score, it means it is a difficult task. So the evaluation that I have in mind is something like this - first we'll have a set of reference models. For any task, we'll have a distribution of their scores. Then we will estimate the parameters of that distribution (this might require some assumptions such as the underlying distribution is Gaussian or something else). The score of a new candidate model will be the percentile on that distribution. That way, a model which makes a breakthrough on a difficult task will be appropriately awarded (high percentile). This method still retains benefits of comparing against systems instead of averaging scores, while also quantifying the task difficulty. As I said before, this is still a half-baked idea. Intuitively, I feel it still retains nice theoretical properties of Borda’s count (proposed in the paper), but we may have to formally and empirically prove that. One drawback of this evaluation is the selection of reference models. For benchmarking purposes, we'll have to keep it fixed, however, the benchmark will evolve over time and the set of reference models may not be a good representative set over time. Curious to know your thoughts on this |
We do not have instance level rank, but for some tasks, we have repeated (typically 10 to calculate std and ci). I don't think it is feasible at least for the current iteration of the benchmark.
I too have an idea and it might be worth a joint discussion on these ideas.
Or that there is a lot of noise (performance can't get any better) - I am unsure how to differentiate the two.
We have discussed something like this for the ScandEval NLU benchmark. However, choosing a reference is quite hard. I believe our three options are:
Another approach that I will also add to the table is modeling it as a generalization factor (a latent factor similar to IQ). This also allows for some hypothesis testing, e.g. do we believe that there is one underlying "language understanding factor" or do we believe that a model has multiple e.g. for language groups or for specific tasks?
Also worth mentioning is that there is no reason why we should require only one metric. We should just have a default in the dashboard. |
After implementing Borda's count as a ranking mechanism, here is the change in rank for the top 20 models in the current leaderboard. The script is here.
There is some shuffling in the top 10, but as a set, the same 10 models remain in the top 10. The shift in ranks is much more prominent in models beyond the top 10. |
@vaibhavad can you also have a column with actual scores |
@sivareddyg - I updated the comment above with actual scores |
Should we add bordo count as well (I want to see how well it gives a notion of closeness). Another point is that I don't believe this metric considers task correlation. Which in the context of voting is fine (that is what we want), but in the context of model development, we don't want to bias our model ranking toward medical just because we include both MedrxivClusteringS2S and MedrxivClusteringP2P. Allowing for a silly example, but which I believe is adequate here: If we want to estimate a person's height (the models' ranking), getting the height of their right leg (task A) is a good first step. However, adding the second leg (task B) shouldn't add much information to our estimate of the height (rank). Getting the torso (task C), though, should add more. Thus, assuming equal weight in votes seems problematic in our case as some of the votes supply the same information. It would be another thing if we believed our distribution of tasks represented the real-world use cases (which I don't believe is the case). Why does this become important? When we, e.g., in #837, filter out correlated tasks (implicitly or explicitly), we believe that we don't lose too much information, but that might change the rank meaningfully (we can test this). A simple solution is, of course, filtering tasks before we do the bordo count. However, it does annoy me that the metric is sensitive to adding correlated tasks (which should really only increase the certainty in our estimate, not make it poorer). I might be missing something here, do let me know if that is the case. |
That does look fairly reasonable as well. A actually see that we are kinda going at this from two approaches:
(1) s generally geared toward selecting the most preferred model by all task while (2) seeks to estimate the model generalisation capability. Luckily atm. the two approaches seem to in general agree, however does rely on quite different assumptions. (2) E.g. has the ability to determine if a task a relevant to gain more information about the latent factor while in (1) tasks (voters) are seem as equals (following democratic ideals) I think this is a very reasonable thing bring up during the writing of the section. |
The goal of this section is to find a meaningful approach to aggregate scores across tasks.
Related to #837
Already discussed in #752
I believe the task is as follows:
The text was updated successfully, but these errors were encountered: