forked from Aloisius/ia-web-commons
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WaybackURLKeyMaker to keep non-utf8 percent encodings #6
Comments
sebastian-nagel
added a commit
that referenced
this issue
Dec 15, 2016
sebastian-nagel
added a commit
that referenced
this issue
Dec 15, 2016
Difficult to solve: Python (2.7) and Java have different string types, based on bytes resp. Unicode characters. The "surt" module used with Python 3 causes a similar problem (internetarchive/surt#19). |
sebastian-nagel
added a commit
that referenced
this issue
Jan 24, 2017
This was referenced Aug 27, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
WaybackURLKeyMaker.makeKey(url)
replaces percent signs by%25
in percent-encoded URL with bytes not representing valid utf-8 encoded characters (before RFC 3986):http://www.aluroba.com/tags/%C3%CE%CA%C7%D1%E5%C7.htm
-> com,aluroba)/tags/%25c3%25ce%25ca%25c7%25d1%25e5%25c7.htm
https://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5
-> ua,1kr)/newslist.html?tag=%25e4%25ee%25f8%25ea%25ee%25eb%25fc%25ed%25ee%25e5
Python's surt module behaves different which breaks look-up in CDX files for such URLs.
The text was updated successfully, but these errors were encountered: