Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select OEM/ANSI code page according to system locale setting #36

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

unxed
Copy link

@unxed unxed commented Jun 7, 2024

@LupinThidr
Copy link

LupinThidr commented Jun 22, 2024

Hello, thank you for this. I've seen your progress across multiple issue trackers regarding Linux and 7zip path encoding.
I ran across your comments while trying to diagnose 7zip unable to decode Shift-JIS (CP932) encoded paths in LZH files
I had thought that this patch would apply to those, but it seems ZipItem is specific to the .zip handler.
Other software such as unar don't fully support the .LZH spec as 7zip does.
I think your iconv conversion should be implemented as a separate function in StringConvert, so it can be easily used with other classes.
I was able to make LzhHandler use it rather the standard MultiByteToUnicodeString, but I don't completely understand the 7zip codebase, so I've hardcoded it to CP932 rather than the a mcp argument.

void MultiByteToUnicodeString3_iconv(UString &res, const AString &s)
{
  res.Empty();
  if (s.IsEmpty())
    return;

  iconv_t cd;
  if ((cd = iconv_open("UTF-8", "CP932")) != (iconv_t)-1) {

    AString sUtf8;

    unsigned slen = s.Len();
    char* src = s.Ptr_non_const();

    unsigned dlen = slen * 4 + 1; // (source length * 4) + null termination
    char* dst = sUtf8.GetBuf_SetEnd(dlen);
    const char* dstStart = dst;

    memset(dst, 0, dlen);

    size_t slen_size_t = static_cast<size_t>(slen);
    size_t dlen_size_t = static_cast<size_t>(dlen);
    size_t done = iconv(cd, &src, &slen_size_t, &dst, &dlen_size_t);

    if (done == (size_t)-1) {
      iconv_close(cd);

      // iconv failed. Falling back to default behavior
      MultiByteToUnicodeString2(res, s, 932);
      return;
    }

    // Null-terminate the result
    *dst = '\0';

    iconv_close(cd);

    AString sUtf8CorrectLength;
    size_t dstCorrectLength = dst - dstStart;
    sUtf8CorrectLength.SetFrom(sUtf8, static_cast<unsigned>(dstCorrectLength));
    if (ConvertUTF8ToUnicode(sUtf8CorrectLength, res) /*|| ignore_Utf8_Errors*/)
      return;
  }

}

and then in LzhHandler's GetProperty:

UString dst;
MultiByteToUnicodeString3_iconv(dst, item.GetName());

UString s = NItemName::WinPathToOsPath(dst);

If you'd like to try implementing it for LZH, here's a sample file and the current / expected output of 7zz l
https://archive.org/download/narcissu/na_sabun.lzh

2007-06-03 22:25:18 .....          537          392  na_sabun/▒C▒▒▒▒▒▒.txt
2007-06-03 22:25:18 .....          537          392  na_sabun/修正差分.txt

(I noticed the Debian package maintainer for 7zz is Japanese, so he may have some experience)

@unxed
Copy link
Author

unxed commented Sep 1, 2024

I don't completely understand the 7zip codebase

Unfortunately, I also don't understand the 7-zip codebase well enough. I'm afraid we need Igor Pavlov to tell us how to implement this.

@Neustradamus
Copy link

@ip7z: What do you think about this PR?

@kattjevfel
Copy link

kattjevfel commented Jan 14, 2025

This and the patch applied to the debian package completely wrecks shift_jis encoded filenames.
normal 7zip: �k�����s[mp4]
with patch: ûkÅ­Ä×ìs[mp4]
convmv'd: 北条時行[mp4]

Difference is that the first one can be fixed with convmv -f shift_jis -t utf8 -r <path>, but the second cannot, as has already improperly been converted to UTF-8. FWIW this always happened with p7zip, but was fixed with 7zip (and now re-broken, hooray)

@unxed
Copy link
Author

unxed commented Jan 14, 2025

Will look at this, thanks!

@unxed
Copy link
Author

unxed commented Jan 14, 2025

Can you please attach a sample file with such name?

@kattjevfel
Copy link

Absolutely, here you go.
test.zip

without patch: 20240323�����C�C����C������������C�h/���p�K���i�ŏ��ɂ��ǂ݉�����.]Read this first�j.TXT
with patch: 20240323é¿òùÿCâCââââCâââżê½ûéâüâCâh/ùÿùpïKû±üiì┼Åëé╔é¿ôÃé¦ë║é│éó.]Read this firstüj.TXT
convmv'd: 20240323お風呂イチャイチャ小悪魔メイド/利用規約(最初にお読み下さい.]Read this first).TXT

Also fwiw the current output with this patch is the same I get in Ark (KDE's archive GUI), no matter if I have an unpatched 7zip. But that's probably them just relying on how p7zip worked.

Unzipping with pure 7z or specifying encoding with unzip is currently the only way to get these filenames without corrupting them.

@unxed
Copy link
Author

unxed commented Jan 19, 2025

@kattjevfel please try to replace

    // Detect required code page name from current locale 
    char *lc = setlocale(LC_CTYPE, "");

    if (lc && lc[0]) {

to

    // Detect required code page name from current locale 
    char *lc = setlocale(LC_CTYPE, "");
    if (!lc || !lc[0]) {
      lc = getenv("LC_CTYPE");
    }

    if (lc && lc[0]) {

after applying this patch, then check you archive once again. Make sure your system locale is set to ja_JP as 7zip should somehow "guess" what legacy code page to use, so its using system locale for that.

@kattjevfel
Copy link

Your patch does work after adding #include <cstdlib>, though having to run it with LANG=ja_JP.UTF-8 every time I encounter a zip packaged on a Japanese system is kind of annoying.

At that point I still prefer having them not be converted to UTF-8 by 7-zip at all and instead have an external tool identify and convert them, as that doesn't even require the locale to be installed.

@unxed
Copy link
Author

unxed commented Jan 21, 2025

After change above it should also work without locale installed.

@kattjevfel
Copy link

You are right, when using LANG=ja_JP.UTF-8 without the locale installed, it behaves like without the patch entirely, which is at least better than the current state.

@unxed
Copy link
Author

unxed commented Jan 21, 2025

Use LC_CTYPE=ja_JP.UTF-8 to enable patch

kattjevfel added a commit to kattjevfel/scripts that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants