-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROUGE: fixed sentences calculation and some minor refactoring #272
Conversation
If the best way to fix the |
0d358ae
to
177e5d4
Compare
Added initial changes:
sub wlcsWeight {
my $r = shift;
my $power = shift;
return $r**$power;
}
sub wlcsWeightInverse {
my $r = shift;
my $power = shift;
return $r**( 1 / $power );
} But Google's implementation doesn't use weights - https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L210 The method One failed test I see with Julia 1.3 because I used Any feedback? |
We don't need 1.3. Can do 1.6 minimum. |
…checked with Google Research implementation
c22bfbb
to
c1fbdbd
Compare
And, the last open question - do we need to have |
Added a minor change - the weighting function can now be changed as an argument if necessary. At least both cases - Google's without weighting and previous implementation with fixed |
I'm OK with that plan. |
…ext#272) ROUGE: * fixed sentences calculation and some minor refactoring * changed output types, optimized, aligned with ROUGE-1.5.5 and checked with Google Research implementation * docs updates * rouge_l_sentence - added an argument with weighted function.
Hi, I started looking at the ROUGE-N/L evaluation components and found that the code was sometimes inefficient and sometimes looked like a bug. This is a first commit with some improvements and fixes.
First of all, I opened the original article https://aclanthology.org/W04-1013.pdf and found an incorrect mention of the "jackknifing procedure". In the code https://github.com/JuliaText/TextAnalysis.jl/blob/master/src/utils.jl#L7 there is some implementation with a name
jackknife_avg
, but this is not theaverage
. The implemented version is looks likeargmax
as literally mentioned in the paper but has a wrong description. See https://en.wikipedia.org/wiki/Jackknife_resampling - the Jackknife average is the same as average and never used in this role.Moreover, I found the original implementation ROUGE-1.5.5.pl. And, in the changelog, I see:
In the Perl implementation I see a very simple solution:
If there are no objections, my offer is to remove the function with the name
jackknife_avg
and implement it in a similar way.I found a very suspicious place in the method
rouge_l_summary
. There is a loopfor ref_sent in ref_sent_list
by a sentence in the reference text. Butref_sent
is not used in the inner loop. OK, I fixed it, but unit tests fail because the final score is lower than the previous one.@djokester and @Ayushk4 how did you calculate the values for the unit test?
Next, I found an inefficient implementation of the method
weighted_lcs
with creating an array of arrays of Float64 instead of a single matrix of Int32. And, I separated this method into two different ones to provide type stability, the initial methodweighted_lcs
returns either Int64 or String. This is definitely another performance issue.At the same time, all the methods
rouge_n
,rouge_l_sentence
,rouge_l_summary
are type unstable and return a single score or an array of scores depending on arguments. We can specify a typeUnion{Float64, Vector{Float64}}
but do we really need it? For now this is the only non API breaking way to fix it.An alternative way is to return an array and implement an external average/argmax function. Or return a named tuple with both the single score, the array, and, maybe, a separate recall, precision, F.
Any thoughts on this?