Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError in split_sentences #41

Open
vetal4444 opened this issue Jan 13, 2015 · 11 comments
Open

UnicodeEncodeError in split_sentences #41

vetal4444 opened this issue Jan 13, 2015 · 11 comments

Comments

@vetal4444
Copy link

  s_iter = [''.join(map(str,y)).lstrip() for y in s_iter]

E UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 85: ordinal not in range(128)

@vetal4444
Copy link
Author

Use latest version from pip (pyteaser==1.0)

@grimpunch
Copy link

This was raised before and closed without being fixed. :(
#33

@xiaoxu193
Copy link
Owner

@grimpunch Thought this was fixed with #34

Apparently not. Will look into this personally

@vetal4444
Copy link
Author

It seems there are old version in pip. Code from master have not this error.

@grimpunch
Copy link

My apologies, vetal4444 is correct. I'm using master now. Pip definitely has an old version

@xiaoxu193
Copy link
Owner

@vetal4444 @grimpunch thank you guys for spotting the error!

@harikt
Copy link

harikt commented Mar 23, 2015

I am still getting the same error. I did tried encode to utf-8 etc. not working :( .

@xiaoxu193
Copy link
Owner

Can you post the link that you tried to run the algorithm on?

@xiaoxu193 xiaoxu193 reopened this Mar 23, 2015
@harikt
Copy link

harikt commented Mar 23, 2015

Sorry that I didn't thanked you for the awesome work you have done. Thank you dude.

Coming back to the problem :

Strange thing is I have installed pytease via pip and have updated via pip install -U .

An earlier version was using pyteaser.py file which is just copied to my folder. That worked from there. But only the pip installation is failing . I am also new to Python. My background is PHP.

from goose import Goose
>>> from pyteaser import Summarize
>>> g = Goose()
>>> page_url = "http://nikic.github.com/2012/06/29/PHP-solves-problems-Oh-and-you-can-program-with-it-too.html"
>>> try:
...     page = g.extract(page_url)
...     description = page.cleaned_text.encode('utf-8')
...     title = page.title
...     summarylist = Summarize(title, description)            
... except:
...     # Exception
...     print "Error occured in summary"
...     raise
... 
Error occured in summary
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 85, in Summarize
    sentences = split_sentences(text)
  File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 209, in split_sentences
    s_iter = [''.join(map(unicode,y)).lstrip() for y in s_iter]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

@harikt
Copy link

harikt commented Mar 31, 2015

Hi @xiaoxu193 ,

I have a question what about keeping a try catch ?

    another = ''
    for y in s_iter:
        try:
             another += ''.join(map(unicode,y)).lstrip()
        except:
            print "some way to catch"
    s_iter = [another]
    s_iter.append(sentences[-1])
    return s_iter

This is a pseudo code though which didn't worked :( .

Just my thought.

Thank you.

@harikt
Copy link

harikt commented Apr 2, 2015

The problem occurring is with split(u'(?<![A-ZА-ЯЁ])([.!?]"?)(?=\s+\"?[A-ZА-ЯЁ])', text, maxsplit=0, flags=REGEX_UNICODE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants