UnicodeEncodeError in split_sentences #41

vetal4444 · 2015-01-13T16:36:23Z

  s_iter = [''.join(map(str,y)).lstrip() for y in s_iter]
E UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 85: ordinal not in range(128)

The text was updated successfully, but these errors were encountered:

vetal4444 · 2015-01-13T16:40:34Z

Use latest version from pip (pyteaser==1.0)

grimpunch · 2015-02-19T10:41:30Z

This was raised before and closed without being fixed. :(
#33

xiaoxu193 · 2015-02-19T14:18:55Z

@grimpunch Thought this was fixed with #34

Apparently not. Will look into this personally

vetal4444 · 2015-02-19T14:31:09Z

It seems there are old version in pip. Code from master have not this error.

grimpunch · 2015-02-20T16:38:38Z

My apologies, vetal4444 is correct. I'm using master now. Pip definitely has an old version

xiaoxu193 · 2015-03-16T05:11:28Z

@vetal4444 @grimpunch thank you guys for spotting the error!

Pip has been updated: https://pypi.python.org/pypi/pyteaser
README has been updated to reflect the change. Updated README for v2.0 in PyPi #43

harikt · 2015-03-23T14:33:51Z

I am still getting the same error. I did tried encode to utf-8 etc. not working :( .

xiaoxu193 · 2015-03-23T14:43:18Z

Can you post the link that you tried to run the algorithm on?

harikt · 2015-03-23T15:40:34Z

Sorry that I didn't thanked you for the awesome work you have done. Thank you dude.

Coming back to the problem :

Strange thing is I have installed pytease via pip and have updated via pip install -U .

An earlier version was using pyteaser.py file which is just copied to my folder. That worked from there. But only the pip installation is failing . I am also new to Python. My background is PHP.

from goose import Goose
>>> from pyteaser import Summarize
>>> g = Goose()
>>> page_url = "http://nikic.github.com/2012/06/29/PHP-solves-problems-Oh-and-you-can-program-with-it-too.html"
>>> try:
...     page = g.extract(page_url)
...     description = page.cleaned_text.encode('utf-8')
...     title = page.title
...     summarylist = Summarize(title, description)            
... except:
...     # Exception
...     print "Error occured in summary"
...     raise
... 
Error occured in summary
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 85, in Summarize
    sentences = split_sentences(text)
  File "/usr/local/lib/python2.7/dist-packages/pyteaser.py", line 209, in split_sentences
    s_iter = [''.join(map(unicode,y)).lstrip() for y in s_iter]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

harikt · 2015-03-31T15:44:32Z

Hi @xiaoxu193 ,

I have a question what about keeping a try catch ?

    another = ''
    for y in s_iter:
        try:
             another += ''.join(map(unicode,y)).lstrip()
        except:
            print "some way to catch"
    s_iter = [another]
    s_iter.append(sentences[-1])
    return s_iter

This is a pseudo code though which didn't worked :( .

Just my thought.

Thank you.

harikt · 2015-04-02T10:24:22Z

The problem occurring is with split(u'(?<![A-ZА-ЯЁ])([.!?]"?)(?=\s+\"?[A-ZА-ЯЁ])', text, maxsplit=0, flags=REGEX_UNICODE)

xiaoxu193 closed this as completed Mar 16, 2015

xiaoxu193 reopened this Mar 23, 2015

harikt mentioned this issue Apr 2, 2015

Strings with ’ is one that breaks. May be we add those to get replaced ? #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError in split_sentences #41

UnicodeEncodeError in split_sentences #41

vetal4444 commented Jan 13, 2015

vetal4444 commented Jan 13, 2015

grimpunch commented Feb 19, 2015

xiaoxu193 commented Feb 19, 2015

vetal4444 commented Feb 19, 2015

grimpunch commented Feb 20, 2015

xiaoxu193 commented Mar 16, 2015

harikt commented Mar 23, 2015

xiaoxu193 commented Mar 23, 2015

harikt commented Mar 23, 2015

harikt commented Mar 31, 2015

harikt commented Apr 2, 2015

UnicodeEncodeError in split_sentences #41

UnicodeEncodeError in split_sentences #41

Comments

vetal4444 commented Jan 13, 2015

vetal4444 commented Jan 13, 2015

grimpunch commented Feb 19, 2015

xiaoxu193 commented Feb 19, 2015

vetal4444 commented Feb 19, 2015

grimpunch commented Feb 20, 2015

xiaoxu193 commented Mar 16, 2015

harikt commented Mar 23, 2015

xiaoxu193 commented Mar 23, 2015

harikt commented Mar 23, 2015

harikt commented Mar 31, 2015

harikt commented Apr 2, 2015