-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix custom encodings in filenames #15
Comments
for zip achivers (and some others) can use |
or change current system locale... LANG=en_US.IBM866 && 7z l test.zip |
https://sourceforge.net/p/p7zip/discussion/383044/thread/3d213124/ 7z l -mcp=866 test.zip
7z l -mcp=1252 test.zip no effects |
What exactly is the problem? I have 10 failing tests in wsl, are these failing tests related to this issue? |
i'm pasting failed tests output, they seem to be related to encoding. |
Just now, I saw this issue, and I was thinking.. there's a PHP class I wrote, which is intended to automatically detect and convert encoding of data, for cases such as these (with.. moderate success), and was wondering.. could that class be helpful for this situation here..? So, I did some testing (just some brief 5-minute testing).. but ran into a problem: The sample filename provided at the SourceForge discussion conforms with several different encodings (actually.. at least ~26.. assuming that I haven't made any mistakes at my end when writing the class in the first place), so I don't think it would be possible (or at least, it wouldn't be something easy/quick/simple) to definitively/conclusively determine one encoding above all the others.. 'x.x (so, maybe not helpful here.. but I thought, maybe I should share the results of this testing anyway, in case it inspires some ideas for possible solutions, or in case it inspires some more thinking from others, which maybe will eventually lead to a solution). <?php
// Note: The sample's extension (".doc") intentionally omitted here, for easier processing, simpler testing, etc.
$Sample = 'Ž›žåœª ©¬£§¢ã¨à ©žª ˜å«ž©žª_™';
$Demojibakefier = new \Maikuolan\Common\Demojibakefier();
foreach ($Demojibakefier->supported() as $CharSet) {
echo $Demojibakefier->checkConformity($Sample, $CharSet) ? 'Conforms with ' . $CharSet . ".\n" : 'Does not conform with ' . $CharSet . ".\n";
} Produces:
If we wanted to get clever, it might be possible to "guess" which encoding is used, by comparing the bytes per where the bytes match up against each character in various encodings, against frequency tables for the occurrence of specific characters in different languages.. but that's a lot of work, prone to false positives (e.g., if someone just uses random characters or weird filenames), and is also very outside the scope of responsibility for Archive7z (so, I would not recommend that). Probably, in order to solve this effectively, would need to be able to get the information from somewhere (e.g., the implementation, the O.S., etc). Maybe just add a public property somewhere to Archive7z, to allow the implementation to specify the preferred encoding, and then work from that (falling back either to UTF-8, or to some kind of "best guess", if the implementation fails to populate the property)? That way, the problem becomes the responsibility of the implementation, and not Archive7z. |
have path for p7zip https://github.com/unxed/oemcp/blob/master/p7zip_oemcp_ZipItem.cpp.patch if (!isUtf8 && ((hostOS == NFileHeader::NHostOS::kFAT) || (hostOS == NFileHeader::NHostOS::kNTFS))) { and i see that path based on is UTF-8 iformation (is trivial) and check |
It is possible to check the encoding via
mb_detect_encoding
...Or at least UTF-8
The text was updated successfully, but these errors were encountered: