invalid continuation byte error #4

conniec · 2012-02-08T20:54:27Z

I'm getting this error when trying to run urlnorm.norm on this url:

http://productiveRamadan.com/ar/%d8%a7%d9%86%d8%aa%d9%81%d8%b9-%d9%85%d9%86-%d8%a7%d9%84%d8%b5%d9%88%d9%85-%d9%88-%d8%aa%d8%ac%d9%86%d8%a8-%d9%87%d8%b0%d9%87-%d8%a7%d9%84%d8%a3%d9%86%d9%88%d8%a7%d8%b9-%d9%85%d9%86-%d8%a7%d9%84%d8%-2

url is valid arabic url, but something in the norm_path() causes value.decode("utf-8") in _unicode() to fail with
"UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 74: invalid continuation byte"

Printing out the value in _unicode() and then trying to do the decode('utf-8') in python shell works fine, any ideas?

Thanks in advance.

miracle2k · 2013-10-20T14:53:20Z

I'm running into this issue at well. The bytes that are percent encoded in this url simply aren't correct utf-8. %d8 expects a follow-up byte, and %-2 isn't one.

I've looked at Chrome and get the impression that it'll not bother to deal with anything but the standard 7-bit ascii characters, which it'll convert to the real character for both the user's benefit and the server, will percent-encode everything else and leave existing percent-encoding as-is, letting the server deal with broken encodings like this. The server will probably either fail with a BadRequest, like productiveramadan.com above, or will ignore the invalid byte and probably end up with a 404.

Python's urllib.unquote() simply calls decode with errors='replace', which would be an easy solution for this library as well, though it strikes me as slightly more correct to leave the invalid escape sequence as is, since U+FFFD, REPLACEMENT CHARACTER � would presumably be valid on its own in a domain, i.e. a server might not treat "http://example.org/�" the same as "http://example.org/%96", even though we'd be normalizing the latter to the former.

I'm not a fan of this library doing url validation in the first place though, so I don't think it should fail here either.

miracle2k · 2013-10-20T19:08:17Z

I've fixed this in https://github.com/miracle2k/urlnorm/tree/4-unquote, based on my Python 3 branch though.

Update __version__

miracle2k added a commit to miracle2k/urlnorm that referenced this issue Oct 20, 2013

Fix unquoting invalid encodings. Close jehiah#4.

8eb0a76

msnoigrs pushed a commit to msnoigrs/urlnorm that referenced this issue Sep 17, 2015

Merge pull request jehiah#4 from jyang15/fix_version

feabdb4

Update __version__

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invalid continuation byte error #4

invalid continuation byte error #4

conniec commented Feb 8, 2012

miracle2k commented Oct 20, 2013

miracle2k commented Oct 20, 2013

invalid continuation byte error #4

invalid continuation byte error #4

Comments

conniec commented Feb 8, 2012

miracle2k commented Oct 20, 2013

miracle2k commented Oct 20, 2013