ROUGE: fixed sentences calculation and some minor refactoring #272

rssdev10 · 2023-09-12T17:17:18Z

Hi, I started looking at the ROUGE-N/L evaluation components and found that the code was sometimes inefficient and sometimes looked like a bug. This is a first commit with some improvements and fixes.

First of all, I opened the original article https://aclanthology.org/W04-1013.pdf and found an incorrect mention of the "jackknifing procedure". In the code https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/utils.jl#L7 there is some implementation with a name jackknife_avg, but this is not the average. The implemented version is looks like argmax as literally mentioned in the paper but has a wrong description. See https://en.wikipedia.org/wiki/Jackknife_resampling - the Jackknife average is the same as average and never used in this role.

Moreover, I found the original implementation ROUGE-1.5.5.pl. And, in the changelog, I see:

#              1.4
#              (1) Remove internal Jackknifing procedure:
#                  Now the ROUGE script will use all the references listed in the
#                  <MODEL></MODEL> section in each <EVAL></EVAL> section and no
#                  automatic Jackknifing is performed. Please see RELEASE-NOTE.txt
#                  for more details.

In the Perl implementation I see a very simple solution:

        if ( $scoreMode eq "A" ) {
            $totalLCSHit    += $lcsHit;
            $totalLCSCount  += $lcsBase;
            $totalLCSCountP += $peer_1grams{"_cn_"};
        }
        elsif ( $scoreMode eq "B" ) {
            if ( $lcsScore > $lcsScoreBest ) {

                # only take a better score (i.e. better match)
                $lcsScoreBest   = $lcsScore;
                $totalLCSHit    = $lcsHit;
                $totalLCSCount  = $lcsBase;
                $totalLCSCountP = $peer_1grams{"_cn_"};
            }
        }
        else {
            # use average mode
            $totalLCSHit    += $lcsHit;
            $totalLCSCount  += $lcsBase;
            $totalLCSCountP += $peer_1grams{"_cn_"};
        }

If there are no objections, my offer is to remove the function with the name jackknife_avg and implement it in a similar way.

I found a very suspicious place in the method rouge_l_summary. There is a loop for ref_sent in ref_sent_list by a sentence in the reference text. But ref_sent is not used in the inner loop. OK, I fixed it, but unit tests fail because the final score is lower than the previous one.

@djokester and @Ayushk4 how did you calculate the values for the unit test?

Next, I found an inefficient implementation of the method weighted_lcs with creating an array of arrays of Float64 instead of a single matrix of Int32. And, I separated this method into two different ones to provide type stability, the initial method weighted_lcs returns either Int64 or String. This is definitely another performance issue.

At the same time, all the methods rouge_n, rouge_l_sentence, rouge_l_summary are type unstable and return a single score or an array of scores depending on arguments. We can specify a type Union{Float64, Vector{Float64}} but do we really need it? For now this is the only non API breaking way to fix it.

An alternative way is to return an array and implement an external average/argmax function. Or return a named tuple with both the single score, the array, and, maybe, a separate recall, precision, F.

Any thoughts on this?

aviks · 2023-10-04T00:49:28Z

If the best way to fix the rouge_n methods is to break api compatibility, I think that should be fine. We can tag a new minor release with that. Or even a 1.0

rssdev10 · 2023-10-16T01:21:21Z

Added initial changes:

added Score structure with precision, recall and fmeasure
all functions return an array of Score structures
argmax by fmeasure or average maybe taken outside
removed jackknife_avg as completely wrong approach
found non described in the initial code "novelty"

TextAnalysis.jl/src/utils.jl

Line 51 in eb6b9aa

increment = (f(k + 1)) - (f(k))

with calculation of weights with sqrt function. Initial Perl-based implementation ROUGE-1.5.5.pl contains power function:

sub wlcsWeight {
    my $r     = shift;
    my $power = shift;

    return $r**$power;
}

sub wlcsWeightInverse {
    my $r     = shift;
    my $power = shift;

    return $r**( 1 / $power );
}

But Google's implementation doesn't use weights - https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L210

The method rouge_l_sentence(reference_summaries, candidate_summary, 1) still gives different result with Google implementation. But there is a difference in the tokenizer (dot symbol is a token in our case).

One failed test I see with Julia 1.3 because I used @kwdef for decoration purposes. Do we really need Julia 1.3?...

Any feedback?

aviks · 2023-10-16T03:46:59Z

We don't need 1.3. Can do 1.6 minimum.

…checked with Google Research implementation

rssdev10 · 2023-10-19T04:03:27Z

rouge_l_sentence now gives similar result as Google's implementation. But this is due to the previous weighting being turned off.

And, the last open question - do we need to have sqrt as a function for weighting? Or, should we have the same implementation as ROUGE-1.5.5.pl with the function power and argument 0.5 for this case? ...

rssdev10 · 2023-10-24T01:32:26Z

Added a minor change - the weighting function can now be changed as an argument if necessary. At least both cases - Google's without weighting and previous implementation with fixed sqrt may be used. If there are no other ideas, I'm willing to merge it.

aviks · 2023-10-24T01:46:28Z

If there are no other ideas, I'm willing to merge it.

I'm OK with that plan.

…ext#272) ROUGE: * fixed sentences calculation and some minor refactoring * changed output types, optimized, aligned with ROUGE-1.5.5 and checked with Google Research implementation * docs updates * rouge_l_sentence - added an argument with weighted function.

ROUGE: fixed sentences calculation and some minor refactoring

66d7657

rssdev10 marked this pull request as draft September 12, 2023 17:18

rssdev10 force-pushed the fix/ROUGE branch 2 times, most recently from 0d358ae to 177e5d4 Compare October 16, 2023 00:58

rssdev10 added 3 commits October 16, 2023 08:58

ROUGE: changed output types, optimized, aligned with ROUGE-1.5.5 and …

7f47290

…checked with Google Research implementation

CI: updated scripts. Minimal Julia is 1.6 now

cfca2c0

revert

c1fbdbd

rssdev10 force-pushed the fix/ROUGE branch 2 times, most recently from c22bfbb to c1fbdbd Compare October 18, 2023 02:29

rssdev10 added 3 commits October 17, 2023 19:31

Merge branch 'master' into fix/ROUGE

d944717

ROUGE: next optimisation

e0a8057

Doc: ROUGE: docs updates

8e553e2

rssdev10 force-pushed the fix/ROUGE branch from a3c2fa8 to 8e553e2 Compare October 19, 2023 03:41

ROUGE: rouge_l_sentence - added an argument with weighted function.

3f195de

rssdev10 marked this pull request as ready for review October 24, 2023 01:32

rssdev10 merged commit 14533f0 into JuliaText:master Oct 24, 2023
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROUGE: fixed sentences calculation and some minor refactoring #272

ROUGE: fixed sentences calculation and some minor refactoring #272

rssdev10 commented Sep 12, 2023

aviks commented Oct 4, 2023

rssdev10 commented Oct 16, 2023

aviks commented Oct 16, 2023

rssdev10 commented Oct 19, 2023

rssdev10 commented Oct 24, 2023

aviks commented Oct 24, 2023

ROUGE: fixed sentences calculation and some minor refactoring #272

ROUGE: fixed sentences calculation and some minor refactoring #272

Conversation

rssdev10 commented Sep 12, 2023

aviks commented Oct 4, 2023

rssdev10 commented Oct 16, 2023

aviks commented Oct 16, 2023

rssdev10 commented Oct 19, 2023

rssdev10 commented Oct 24, 2023

aviks commented Oct 24, 2023