Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid continuation byte error #4

Open
conniec opened this issue Feb 8, 2012 · 2 comments
Open

invalid continuation byte error #4

conniec opened this issue Feb 8, 2012 · 2 comments

Comments

@conniec
Copy link

conniec commented Feb 8, 2012

I'm getting this error when trying to run urlnorm.norm on this url:

http://productiveRamadan.com/ar/%d8%a7%d9%86%d8%aa%d9%81%d8%b9-%d9%85%d9%86-%d8%a7%d9%84%d8%b5%d9%88%d9%85-%d9%88-%d8%aa%d8%ac%d9%86%d8%a8-%d9%87%d8%b0%d9%87-%d8%a7%d9%84%d8%a3%d9%86%d9%88%d8%a7%d8%b9-%d9%85%d9%86-%d8%a7%d9%84%d8%-2

url is valid arabic url, but something in the norm_path() causes value.decode("utf-8") in _unicode() to fail with
"UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 74: invalid continuation byte"

Printing out the value in _unicode() and then trying to do the decode('utf-8') in python shell works fine, any ideas?

Thanks in advance.

@miracle2k
Copy link

I'm running into this issue at well. The bytes that are percent encoded in this url simply aren't correct utf-8. %d8 expects a follow-up byte, and %-2 isn't one.

I've looked at Chrome and get the impression that it'll not bother to deal with anything but the standard 7-bit ascii characters, which it'll convert to the real character for both the user's benefit and the server, will percent-encode everything else and leave existing percent-encoding as-is, letting the server deal with broken encodings like this. The server will probably either fail with a BadRequest, like productiveramadan.com above, or will ignore the invalid byte and probably end up with a 404.

Python's urllib.unquote() simply calls decode with errors='replace', which would be an easy solution for this library as well, though it strikes me as slightly more correct to leave the invalid escape sequence as is, since U+FFFD, REPLACEMENT CHARACTER � would presumably be valid on its own in a domain, i.e. a server might not treat "http://example.org/�" the same as "http://example.org/%96", even though we'd be normalizing the latter to the former.

I'm not a fan of this library doing url validation in the first place though, so I don't think it should fail here either.

miracle2k added a commit to miracle2k/urlnorm that referenced this issue Oct 20, 2013
@miracle2k
Copy link

I've fixed this in https://github.com/miracle2k/urlnorm/tree/4-unquote, based on my Python 3 branch though.

msnoigrs pushed a commit to msnoigrs/urlnorm that referenced this issue Sep 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants