Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some symbols to punctuation in strip_punctuation #9

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

remusao
Copy link

@remusao remusao commented Nov 12, 2013

Hi,

Since " and ' are considered punctuation in English, I thought it would be a good idea to add this characters in the function strip_punctuation! in the preprocessing module. I don't know if there is a reason for not including them in the regex, but I needed them in a project of mine, so here is a patch if you think it could be useful for others too.

Bests,
Remusao

@johnmyleswhite
Copy link
Collaborator

This is tricky. Unlike other punctuation, single quote marks often occur within tokens, so stripping them causes a lot of problems. We should see what other systems do.

@remusao
Copy link
Author

remusao commented Nov 12, 2013

I agree. Why not letting the user choose? Or simply stripping ' and " at the beginning and end of the string instead of everywhere? It would preserve tokens containing this symbols? In my case I mainly liked to avoid tokens like "toto

@johnmyleswhite
Copy link
Collaborator

Let's see what R's tm and Python's NLTK do, then make a decision.

@karl-kurzke
Copy link

And is it possible to add "[" and "]" to exactly this regex?
I had some problems with the remove_words! function, because there where such brackets inside my corpus and the closing ] was missed.
But perhaps cleaner it would be to update the remove_words function and to clean regexSyntax out of of this word.
Something like:

regexSigns = split("[]{}*()","")
for sign in regexSigns
    word = replace(word, Regex(string("\\",sign)),string("\\",sign))
end 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants