Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence spliting of sentences with out whitespace after period #38

Open
oxinabox opened this issue Oct 11, 2019 · 2 comments
Open

Sentence spliting of sentences with out whitespace after period #38

oxinabox opened this issue Oct 11, 2019 · 2 comments

Comments

@oxinabox
Copy link
Member

julia>WordTokenizers.split_sentences(" This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. ")
7-element Array{SubString{String},1}:
" This is a sentence.Laugh Out Loud."
"Keep coding."
"No."
"Yes!"
"True!"
"ohh!ya!"
"me too."
I observed that the sentence which has no space after delimiter(Obviously that sentence grammatically incorrect) is not considered as two separate sentences(Like .Laugh Out Loud. and Ohh!ya!). Can this consider as an issue?

Originally posted by @RohitPingale in #32 (comment)

@RohitPingale
Copy link

RohitPingale commented Oct 14, 2019

>>> from nltk.tokenize import sent_tokenize
>>> text = " This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. "
>>> sent_tokenize(text)
[' This is a sentence.Laugh Out Loud.', 'Keep coding.', 'No.', 'Yes!', 'True!', 'ohh!ya!', 'me too.']
I tried the same example in python it giving the same output, should we consider it as the benchmark or we have to split those sentences anyway?

@oxinabox
Copy link
Member Author

@ninjin thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants