-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLEURT #2
Comments
It seems that they have changed API, so for 2, the command might be |
For 1, it seems that BLEURT does not guarantee the score to be within |
Indeed, after checking the source code, it seems that it is now OMG, I cannot believe that there is discrepancy even in one source code file. |
From Ruixuan's result:
And BLUERT-tiny's distribution (from bluert#1): Is this basically saying that BLUERT thinks (reference, system summary) are terrible pairs? Is this expected? If not, we probably want to try different BLUERT models. |
What are the system summaries used here? Are they from the dataset itself, or generated by some summarizers? This will determine the quality of the system summaries to better interpret what are the "expected scores". |
The system summaries are from the RealSumm and Newsroom datasets. |
Like I said at the top, BLEURT authors mention that BLEURT-tiny is horribly inaccurate. They recommend BLEURT-20. |
Yeah, I agree, we should probably try BLEURT-20 models. When BLEURT-tiny is "horribly inaccurate", it makes less sense to compare two inaccurate numbers. |
Yes. After switching to BLEURT-20, I have got positive results in |
Two issues.
bleurt.compute(predictions=predictions, references=references)
tobleurt.compute(predictions=predictions, references=references, checkpoint="BLEURT-20")
. However, since we use the same language model for both traditional and new approach, I am not sure whether a language model matters.@NKWBTB @lihebi What are your thoughts for 2?
@TURX Please let me know your thoughts for 1.
The text was updated successfully, but these errors were encountered: