Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in function embedding from sentiment.jl #129

Open
mboedigh opened this issue Feb 28, 2019 · 3 comments
Open

bug in function embedding from sentiment.jl #129

mboedigh opened this issue Feb 28, 2019 · 3 comments

Comments

@mboedigh
Copy link

The following code seems to have a bug in that it reshapes a matrix in an apparent attempt to transpose:

function embedding(embedding_matrix, x)
    # inefficient code:
    temp = embedding_matrix[:, Int64(x[1])+1]
    for i=2:length(x)
        temp = hcat(temp, embedding_matrix[:, Int64(x[i])+1])
    end
   # bug
    return reshape(temp, reverse(size(temp)))  
end

I propose the following:

function embedding(embedding_matrix, x)
    temp = embedding_matrix[:, Int64.(x).+1]'; 
end
# After this change the results of the following are much improved
d_good = StringDocument("A very nice thing that everyone likes")
prepare!(d_good, strip_case | strip_punctuation )
d_bad = StringDocument("A boring long dull movie that no one likes")
prepare!(d_bad, strip_case | strip_punctuation)
s = SentimentAnalyzer()
s(d_good)
s(d_bad)
@aviks
Copy link
Member

aviks commented Mar 1, 2019

Good catch, thanks.

@aviks
Copy link
Member

aviks commented Mar 2, 2019

On further consideration, this change causes the existing test to fail :

d = StringDocument("a horrible thing that everyone hates")

Investigating...

@mboedigh
Copy link
Author

mboedigh commented Mar 4, 2019

I believe there is a deeper problem in the word embedding. Bad words such as "hate" get good scores.

using TextAnalysis
d_bad = StringDocument("a horrible thing that everyone hates")
prepare!(d_bad, strip_case | strip_punctuation)
s = SentimentAnalyzer()
[s(w) for w in words]

I note that there are 88587 words in the dictionary
length(s.model.words)

but the lookup table has only 5000 entries
size(s.model.weight[:embedding_1]["embedding_1"]["embeddings:0"],2)

Invoking s(d_illegal), with d_illegal being any StringDocument containing a word mapping higher than 5000 will cause an error. I don't know where the weights come from exactly so I can't track it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants