Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaybackURLKeyMaker to keep non-utf8 percent encodings #6

Open
sebastian-nagel opened this issue Dec 15, 2016 · 1 comment · May be fixed by #28
Open

WaybackURLKeyMaker to keep non-utf8 percent encodings #6

sebastian-nagel opened this issue Dec 15, 2016 · 1 comment · May be fixed by #28

Comments

@sebastian-nagel
Copy link

WaybackURLKeyMaker.makeKey(url) replaces percent signs by %25 in percent-encoded URL with bytes not representing valid utf-8 encoded characters (before RFC 3986):

http://www.aluroba.com/tags/%C3%CE%CA%C7%D1%E5%C7.htm
-> com,aluroba)/tags/%25c3%25ce%25ca%25c7%25d1%25e5%25c7.htm
https://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5
-> ua,1kr)/newslist.html?tag=%25e4%25ee%25f8%25ea%25ee%25eb%25fc%25ed%25ee%25e5

Python's surt module behaves different which breaks look-up in CDX files for such URLs.

@sebastian-nagel
Copy link
Author

Difficult to solve: Python (2.7) and Java have different string types, based on bytes resp. Unicode characters. The "surt" module used with Python 3 causes a similar problem (internetarchive/surt#19).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant