Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLEURT #2

Open
forrestbao opened this issue Jun 5, 2022 · 9 comments
Open

BLEURT #2

forrestbao opened this issue Jun 5, 2022 · 9 comments
Assignees
Labels

Comments

@forrestbao
Copy link
Contributor

forrestbao commented Jun 5, 2022

Two issues.

  1. Why BLEURT medians here are negative? BLEURT value should always be between 0 and 1.
  2. BLEURT can be based on different architectures. Default model in HuggingFace is BLEURT-Tiny, which is very inaccurate and not recommended by BLEURT authors. Please use BLEURT-20, which is reportedly the best by the authors. See more on different version of BLEURT here. To do that, simply change bleurt.compute(predictions=predictions, references=references) to bleurt.compute(predictions=predictions, references=references, checkpoint="BLEURT-20"). However, since we use the same language model for both traditional and new approach, I am not sure whether a language model matters.

@NKWBTB @lihebi What are your thoughts for 2?
@TURX Please let me know your thoughts for 1.

@TURX
Copy link
Collaborator

TURX commented Jun 5, 2022

It seems that they have changed API, so for 2, the command might be model = evaluate.load('bleurt', config_name='BLEURT-20', module_type='metric')

@TURX
Copy link
Collaborator

TURX commented Jun 5, 2022

For 1, it seems that BLEURT does not guarantee the score to be within $[0.0, 1.0]$ range, as in google-research/bleurt#1

@forrestbao
Copy link
Contributor Author

It seems that they have changed API, so for 2, the command might be model = evaluate.load('bleurt', config_name='BLEURT-20', module_type='metric')

Indeed, after checking the source code, it seems that it is now config_name. module_type seems unnecessary.

OMG, I cannot believe that there is discrepancy even in one source code file. _KWARGS_DESCRIPTION says checkpoint while the code says config_name. It's so messy.

@lihebi
Copy link
Collaborator

lihebi commented Jun 6, 2022

From Ruixuan's result:

Newsroom (Trad) Newsroom (New) RealSumm (Trad) RealSumm (New)
BLEURT -0.895417 -1.544569 -0.100036 -1.042327

And BLUERT-tiny's distribution (from bluert#1):

image

Is this basically saying that BLUERT thinks (reference, system summary) are terrible pairs? Is this expected? If not, we probably want to try different BLUERT models.

@lihebi
Copy link
Collaborator

lihebi commented Jun 6, 2022

What are the system summaries used here? Are they from the dataset itself, or generated by some summarizers? This will determine the quality of the system summaries to better interpret what are the "expected scores".

@TURX
Copy link
Collaborator

TURX commented Jun 6, 2022

What are the system summaries used here? Are they from the dataset itself, or generated by some summarizers? This will determine the quality of the system summaries to better interpret what are the "expected scores".

The system summaries are from the RealSumm and Newsroom datasets.

@forrestbao
Copy link
Contributor Author

And BLUERT-tiny's distribution (from bluert#1):

image

Is this basically saying that BLUERT thinks (reference, system summary) are terrible pairs? Is this expected? If not, we probably want to try different BLUERT models.

Like I said at the top, BLEURT authors mention that BLEURT-tiny is horribly inaccurate. They recommend BLEURT-20.

@lihebi
Copy link
Collaborator

lihebi commented Jun 6, 2022

Yeah, I agree, we should probably try BLEURT-20 models. When BLEURT-tiny is "horribly inaccurate", it makes less sense to compare two inaccurate numbers.

@TURX
Copy link
Collaborator

TURX commented Jun 6, 2022

Yeah, I agree, we should probably try BLEURT-20 models. When BLEURT-tiny is "horribly inaccurate", it makes less sense to compare two inaccurate numbers.

Yes. After switching to BLEURT-20, I have got positive results in $[0.0, 1.0]$. I will update the results after calculating correlations, which I am working on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants