Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not handle non-UTF8-encodable pathnames #20

Open
BartMassey opened this issue May 18, 2020 · 6 comments
Open

Does not handle non-UTF8-encodable pathnames #20

BartMassey opened this issue May 18, 2020 · 6 comments

Comments

@BartMassey
Copy link

Pathnames containing non-UTF8 characters are not indexed, but instead produce a warning during indexing.

I am investigating a fix, but it looks quite difficult, which is probably why it has not been done previously.

@ngirard
Copy link
Owner

ngirard commented May 18, 2020

Thanks for reporting !

Lolcate could gain the ability to deal with non-utf8 pathnames just like Fd did a while ago.

However, I'd be more inclined to report non-utf8 issues and dangling symlinks issues back to the user for further investigation/treatment instead of trying to index them, since they shouldn't remain unsolved in the first place.

Something like

Found 18 non-UTF8 path names. See /tmp/lolcate.xxx1 for details.
Found 2 dangling symlinks. See /tmp/lolcate.xxx2 for details.

What do you think ?

@BartMassey
Copy link
Author

I have files on my box with ISO-8859-1 names that are older than the Unicode standard. I also have files with names produced by disk errors. There's no reasonable way to "fix" these files: they just need to be indexed.

The main issue, which I haven't looked into yet, is what regex does with non-utf8 strings, and whether it can make sense for this use case.

@ngirard
Copy link
Owner

ngirard commented Nov 23, 2020

Incidentally I sumbled upon this gist from @ssokolow.

Unfortunalely the code doesn't seem to be Windows-compatible.

@ssokolow
Copy link

ssokolow commented Nov 23, 2020

Only because I don't have a Windows machine and have so much else to do that I didn't have time to set up a modern.ie testing VM to make sure I was implementing the same transformation that ntfs-3g does for unpaired surrogates.

Poke me around Christmas when my brother is visiting and I'll plug a USB stick into his PC to generate the requisite test files and test the resulting code.

If someone else wants to implement it more quickly, you need to use cfg to switch between use std::os::unix::ffi::{OsStrExt, OsStringExt}; and use std::os::windows::ffi::{OsStrExt, OsStringExt}; and to provide an alternative to as_bytes and from_vec using encode_wide and from_wide.

It's just the "What does Linux see when ntfs-3g encounters a filename from Windows containing un-paired surrogates? ...and how should I encode the data to ensure the transformation round-trips between a Linux build and a Windows build of the code?" that I'm blocked on.

@ngirard
Copy link
Owner

ngirard commented Nov 23, 2020

Thank you very much, @ssokolow !
I don't have any Windows system at my disposal, so I'm very likely to respond to your invitation to poke you around Christmas !

@ngirard
Copy link
Owner

ngirard commented Nov 23, 2020

For reference, here is a relevant discussion on r/rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants