Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improper stemming of NGram documents #149

Open
tanmaykm opened this issue May 3, 2019 · 4 comments
Open

improper stemming of NGram documents #149

tanmaykm opened this issue May 3, 2019 · 4 comments
Labels
help wanted good for beginners

Comments

@tanmaykm
Copy link
Contributor

tanmaykm commented May 3, 2019

Stemming a NGramDocument stems only the last word of each ngram. Notice below how repository is stemmed to repositori in one place but left intact in another.

julia> td = NGramDocument("this repository of julia language", 3)
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("language"=>1,"repository"=>1,"this"=>1,"this repository of"=>1,"of julia language"=>1,"julia language"=>1,"of"=>1,"julia"=>1,"this repository"=>1,"repository of"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(td); td
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("languag"=>1,"this"=>1,"this repository of"=>1,"of julia languag"=>1,"this repositori"=>1,"of"=>1,"julia"=>1,"repositori"=>1,"repository of"=>1,"of julia"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

While stemming a StringDocument stems each word:

julia> sd = StringDocument("this repository of julia language")
StringDocument{String}("this repository of julia language", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(sd); sd
StringDocument{String}("this repositori of julia languag", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
@aviks aviks added the help wanted good for beginners label May 3, 2019
zgornel added a commit to zgornel/StringAnalysis.jl that referenced this issue May 7, 2019
@sean-gauss
Copy link

Is work still needed on this issue? @aviks

@bnriiitb
Copy link

@aviks is this issue fixed or still help needed?

@sean-gauss
Copy link

I intended to finish this, however, at the moment I am a bit busy with my internship. If you can resolve this issue you can freely proceed.

@mostol
Copy link

mostol commented Feb 16, 2022

@aviks Hi! I think I figured out what's going on here. It comes down to the stem function in line 38 of stemmer.jl below, which stems the n-gram (token), resulting in its stemmed version (new_token):

function stem!(stemmer::Stemmer, d::NGramDocument)
for token in keys(d.ngrams)
new_token = stem(stemmer, token)
if new_token != token
if haskey(d.ngrams, new_token)
d.ngrams[new_token] = d.ngrams[new_token] + d.ngrams[token]
else
d.ngrams[new_token] = d.ngrams[token]
end
delete!(d.ngrams, token)
end
end
end

The problem arises from the fact that token (the n-gram) is actually just stored as a string. The name "token" is maybe a bit of a misnomer—each n-gram is really a string of tokens that we want stemmed, so we either want to think about it as a StringDocument and stem each word in the string, or we'd want to think about it as a TokenDocument and stem each token of the n-gram individually. Right now, the n-gram is stemmed as just a String, which means the n-gram is interpreted as one single entity which has its end stemmed, rather than a list of n entities to be stemmed individually.

This might mean fundamentally altering the nature of NGramDocuments to be made up of either StringDocuments or vectors of strings like TokenDocuments are (the former probably being easier to actually implement, the latter perhaps being a little more meaningful?). I'd be glad to help implement a change in either direction!

(Or, if you want a lazy fix that doesn't think about anything else that's going on, you can just change

new_token = stem(stemmer, token)

to

new_token = stem_all(stemmer, token)

and be done with it, which is also an option...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted good for beginners
Projects
None yet
Development

No branches or pull requests

5 participants