BLEURT #2

forrestbao · 2022-06-05T06:15:42Z

Two issues.

Why BLEURT medians here are negative? BLEURT value should always be between 0 and 1.
BLEURT can be based on different architectures. Default model in HuggingFace is BLEURT-Tiny, which is very inaccurate and not recommended by BLEURT authors. Please use BLEURT-20, which is reportedly the best by the authors. See more on different version of BLEURT here. To do that, simply change bleurt.compute(predictions=predictions, references=references) to bleurt.compute(predictions=predictions, references=references, checkpoint="BLEURT-20"). However, since we use the same language model for both traditional and new approach, I am not sure whether a language model matters.

@NKWBTB @lihebi What are your thoughts for 2?
@TURX Please let me know your thoughts for 1.

The text was updated successfully, but these errors were encountered:

TURX · 2022-06-05T15:55:07Z

It seems that they have changed API, so for 2, the command might be model = evaluate.load('bleurt', config_name='BLEURT-20', module_type='metric')

TURX · 2022-06-05T18:22:21Z

For 1, it seems that BLEURT does not guarantee the score to be within $[0.0, 1.0]$ range, as in google-research/bleurt#1

forrestbao · 2022-06-05T18:53:52Z

It seems that they have changed API, so for 2, the command might be model = evaluate.load('bleurt', config_name='BLEURT-20', module_type='metric')

Indeed, after checking the source code, it seems that it is now config_name. module_type seems unnecessary.

OMG, I cannot believe that there is discrepancy even in one source code file. _KWARGS_DESCRIPTION says checkpoint while the code says config_name. It's so messy.

lihebi · 2022-06-06T05:28:30Z

From Ruixuan's result:

Newsroom (Trad) Newsroom (New) RealSumm (Trad) RealSumm (New)

BLEURT -0.895417 -1.544569 -0.100036 -1.042327

And BLUERT-tiny's distribution (from bluert#1):

Is this basically saying that BLUERT thinks (reference, system summary) are terrible pairs? Is this expected? If not, we probably want to try different BLUERT models.

lihebi · 2022-06-06T05:33:57Z

What are the system summaries used here? Are they from the dataset itself, or generated by some summarizers? This will determine the quality of the system summaries to better interpret what are the "expected scores".

TURX · 2022-06-06T05:36:00Z

What are the system summaries used here? Are they from the dataset itself, or generated by some summarizers? This will determine the quality of the system summaries to better interpret what are the "expected scores".

The system summaries are from the RealSumm and Newsroom datasets.

forrestbao · 2022-06-06T05:46:46Z

And BLUERT-tiny's distribution (from bluert#1):

Is this basically saying that BLUERT thinks (reference, system summary) are terrible pairs? Is this expected? If not, we probably want to try different BLUERT models.

Like I said at the top, BLEURT authors mention that BLEURT-tiny is horribly inaccurate. They recommend BLEURT-20.

lihebi · 2022-06-06T06:43:06Z

Yeah, I agree, we should probably try BLEURT-20 models. When BLEURT-tiny is "horribly inaccurate", it makes less sense to compare two inaccurate numbers.

TURX · 2022-06-06T06:46:10Z

Yeah, I agree, we should probably try BLEURT-20 models. When BLEURT-tiny is "horribly inaccurate", it makes less sense to compare two inaccurate numbers.

Yes. After switching to BLEURT-20, I have got positive results in $[0.0, 1.0]$. I will update the results after calculating correlations, which I am working on.

…rrelation fix: #1, #2

forrestbao assigned TURX Jun 5, 2022

TURX added a commit that referenced this issue Jun 7, 2022

script: format model output, load human score, compute and analyze co…

b486b9f

…rrelation fix: #1, #2

forrestbao added the metric label Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLEURT #2

BLEURT #2

forrestbao commented Jun 5, 2022 •

edited

Loading

TURX commented Jun 5, 2022

TURX commented Jun 5, 2022

forrestbao commented Jun 5, 2022

lihebi commented Jun 6, 2022

lihebi commented Jun 6, 2022

TURX commented Jun 6, 2022

forrestbao commented Jun 6, 2022

lihebi commented Jun 6, 2022

TURX commented Jun 6, 2022

BLEURT #2

BLEURT #2

Comments

forrestbao commented Jun 5, 2022 • edited Loading

TURX commented Jun 5, 2022

TURX commented Jun 5, 2022

forrestbao commented Jun 5, 2022

lihebi commented Jun 6, 2022

lihebi commented Jun 6, 2022

TURX commented Jun 6, 2022

forrestbao commented Jun 6, 2022

lihebi commented Jun 6, 2022

TURX commented Jun 6, 2022

forrestbao commented Jun 5, 2022 •

edited

Loading