Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SURT URL canicalization to handle non-UTF-8 percent-encoded characters #102

Merged
merged 3 commits into from
Dec 14, 2024

Conversation

sebastian-nagel
Copy link
Contributor

WaybackURLKeyMaker.makeKey(url) replaces percent signs by %25 in percent-encoded URL with bytes not representing valid utf-8 encoded characters (before RFC 3986):

http://www.aluroba.com/tags/%C3%CE%CA%C7%D1%E5%C7.htm
-> com,aluroba)/tags/%25c3%25ce%25ca%25c7%25d1%25e5%25c7.htm
https://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5
-> ua,1kr)/newslist.html?tag=%25e4%25ee%25f8%25ea%25ee%25eb%25fc%25ed%25ee%25e5

Python's surt module behaves different which breaks look-up in CDX files for such URLs:

$> pip3 show surt
Name: surt
Version: 0.3.1
Summary: Sort-friendly URI Reordering Transform (SURT) python package.
...

$> python3
>>> from surt import surt
>>> surt("http://1kr.ua/newslist.html?tag=%E4%EE%F8%EA%EE%EB%FC%ED%EE%E5")
'ua,1kr)/newslist.html?tag=%e4%ee%f8%ea%ee%eb%fc%ed%ee%e5'
>>> surt("https://www.insbase.ac/xoops2/modules/xpwiki/?%A4%D5%A4%AF%A4%AA%A4%AB%B8%A9%A4%AA%A4%AA%A4%CE%A4%B8%A4%E7%A4%A6%BB%D4")
'ac,insbase)/xoops2/modules/xpwiki?%a4%d5%a4%af%a4%aa%a4%ab%b8%a9%a4%aa%a4%aa%a4%ce%a4%b8%a4%e7%a4%a6%bb%d4'

Notes:

@tfmorris
Copy link
Contributor

tfmorris commented Dec 5, 2024

Thanks @sebastian-nagel. I'm not sure if it's important, but note that the issue reference in f7be47b resolves incorrectly in this new context. It's actually a reference to commoncrawl#6

@ato
Copy link
Member

ato commented Dec 5, 2024

That's unfortunate.

Implementation "%C3" "%C3%23" "%C3%80"
jwarc %25c3 %25c3%23 %c3%80
OutbackCDX %25c3 %25c3%23 %c3%80
urlcanon (java) %ef%bf%bd %ef%bf%bd%23 %c3%80
urlcanon (python) %c3 %c3%23 %c3%80
surt (python) %c3 %c3%23 %c3%80
warcio.js %c3 %c3%23 %c3%80
webarchive-commons %25c3 %25c3%23 %c3%80

Technically this is the reference implementation that the Python surt module is supposed to be a port of and this is a breaking change for existing CDX files generated by the original Java tools.

On the other hand as OpenWayback is no longer updated and many organisations are moving to pywb, it may indeed pragmatically be better to follow the Python and JavaScript implementations.

@sebastian-nagel
Copy link
Contributor Author

Numbers from Nov. 2024: in a sample of 10 million URLs, 4k (0.04%) encode non-ASCII characters not using UTF-8. JP and RU are frequent top-level domains of such URLs, but they're found practically everywhere (83 different TLDs in the sample).

@tfmorris
Copy link
Contributor

tfmorris commented Dec 6, 2024

it may indeed pragmatically be better to follow the Python and JavaScript implementations.

That has the added advantage of being correct and conforming to the spec.

I think a bigger question is how to phase it in with the least impact on the ecosystem. The CDX spec doesn't include a version number or any information on the writer of the CDX file making it difficult for readers to know how to interpret any given file.

@ato
Copy link
Member

ato commented Dec 9, 2024

I'm going to wait a few more days for comments and if there's no objections raised I will merge this.

@ato ato merged commit 3907d24 into iipc:master Dec 14, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants