Skip to content

Commit

Permalink
Handle encodings better; make the sum type "public"
Browse files Browse the repository at this point in the history
Windows does not have a direct analogue of LANG=C or LC_ALL=C. Some
programs give them special treatment, but they do not affect the
way localized behavior of the Windows API operates. In particular,
the bash.exe WSL wrapper, as well as wsl.exe and wslconfig.exe, do
not produce their own localized messages (for errors not
originating in a running distribution) when they are set. Windows
also provides significant localization through localized versions
of Windows, so changing language settings in Windows, even
system-wide, does not always produce the same effect that many or
most Windows users who use a particular language would experience.

Various encodings may appear when bash.exe is WSL-related but gives
its own error message. Such a message is often in UTF-16LE, which
is what Windows uses internally, and preserves any localization.
That is the more well behaved scenario and already was detected;
this commit moves, but does not change, the code for that.

The situation where it is not UTF-16LE was previously handled by
treating it as UTF-8. Because the default strict error treatment
was used, this would error out in test discovery in some localized
setups, preventing all tests in test_index from running, including
the majority of them that are not related to hooks. This fixes that
by doing better detection that should decode the mesages correctly
most of the time, that should in practice decode them well enough
to tell (by the aka.ms URL) if the message is complaining about
there being no installed distribution all(?) of the time, and that
should avoid breaking unrelated tests even if that can't be done.

An English non-UTF-16LE message appears on GitHub Actions CI when
no distribution is installed. Testing of this situation on other
languages was performed in a virtual machine on a development
machine.

That the message is output in a narrow character set some of the
time when bash.exe produces it appears to be a limitation of the
bash.exe wrapper. In particular, with a zh-CN version of Windows
(and with the language not changed to anything else), a localized
message in Simplified Chinese was correctly printed when running
wsl.exe, but running bash.exe produced literal "?" characters in
place of Chinese characters (it was not a display or font issue,
and they were "?" and not Unicode replacement characters). The
change here does not overcome that; the literal "?" characters will
be included. But "https://aka.ms/wslstore" is still present if it
is an error about not having any distributions, so the correct
status is still inferred.

For more information on code pages in Windows, see:
https://serverfault.com/a/836221

The following alternatives to all or part of the approach taken
here were considered but, at least for now, not done, because they
would not clearly be simpler or more correct:

- Abandoning this "run bash.exe and see what it shows us" approach
  altogether and instead reimplementing the rules CreateProcessW
  uses, to find if the bash.exe the system finds is the one in
  System32, and then, if so, checking the metadata in that
  executable to determine if it's the WSL wrapper. I believe that
  would be even more complex to do correctly than it seems; the
  behavior noted in the WinBashStatus docstring and recent commit
  messages is not the whole story. The documented search order for
  CreateProcessW seems not to be followed in some circumstances.
  One is that the Windows Store version of Python seems always to
  forgo the usual System32 search that precedes seaching directories
  in PATH. It looks like there may also be some subtleties in which
  directories 32-bit builds search in.

- Using chardet. Although the chardet library is excellent, it is
  not clear that the code needed to bridge this highly specific use
  case to it would be simpler than the code currently in use. Some
  of the work might still need to be done by other means; when I
  tested it out for this, this did not detect the UTF-16LE messages
  as such for English. (They are often also valid UTF-8, because
  interleaving null characters is, while strange, permitted.)

- Also calling wsl.exe and/or wslconfig.exe. It's still necessary
  to call bash.exe, because it may not be the WSL bash, even on a
  system with WSL fully set up. Furthermore, those tools' output
  seem to vary in some complex ways, too. Using only one subprocess
  for the detection seemed the simplest. Even using "wsl --list"
  would introduce significant additional logic. Sometimes its
  output is a list of distributions, sometimes it is an error
  message, and if WSL is not set up it may be a help message.

- Using the Windows API to check for WSL systems.
  https://learn.microsoft.com/en-us/windows/win32/api/wslapi/ does
  not currently include functions to list registered distributions.

- Attempting to get wsl.exe to produce an English message using
  Windows API techniques like those used in Locale Emulator. This
  would be complicated, somewhat unintuitive and error prone to do
  in Python, and I am not sure how well it would work on a system
  that does not have an English language pack installed.

- Checking on disk for WSL distributions in the places they are
  most often expected to be. This would intertwine WinBashStatus
  with deep details of how WSL actually operates, and this seems
  like the sort of thing that is likely to change in the future.

However, there may be a more straightforward way to do this (that
is at least as correct and that remains transparent to debug).
Especially if the WinBashStatus class remains in test_index for
long (rather than just being used to aid in debugging existing test
falures and possible subsequent design decisions for making commit
hooks work more robustly on Windows in GitPython), then this may
be worth revisiting.

Thus it is *not* with the intention of treating WinBashStatus as a
"stable" part of the test suite that it is renamed from
_WinBashStatus. This is instead done because:

- Like HOOKS_SHEBANG, it makes sense to import it when figuring out
  how the tests work or debugging them.

- When debugging, it is intended that it be imported to call
  check() and examine the resulting `process` and `message`
  information, at least in the CheckError case.
  • Loading branch information
EliahKagan committed Nov 28, 2023
1 parent d779a75 commit 9ac2438
Showing 1 changed file with 54 additions and 16 deletions.
70 changes: 54 additions & 16 deletions test/test_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,13 @@


@sumtype
class _WinBashStatus:
class WinBashStatus:
"""Status of bash.exe for native Windows. Affects which commit hook tests can pass.
Call :meth:`check` to check the status.
The :class:`CheckError` and :class:`WinError` cases should not typically be used in
``skip`` or ``xfail`` mark conditions, because they represent unexpected situations.
"""

Inapplicable = constructor()
Expand Down Expand Up @@ -84,10 +87,10 @@ def check(cls):
On Windows, `Popen` calls ``CreateProcessW``, which checks some locations before
using the ``PATH`` environment variable. It is expected to try the ``System32``
directory, even if another directory containing the executable precedes it in
``PATH``. (Other differences are less relevant here.) When WSL is present, even
with no distributions, ``bash.exe`` usually exists in ``System32``, and `Popen`
finds it even if another ``bash.exe`` precedes it in ``PATH``, as on CI. If WSL
is absent, ``System32`` may still have ``bash.exe``, as Windows users and
``PATH``. (The other differences are less relevant here.) When WSL is present,
even with no distributions, ``bash.exe`` usually exists in ``System32``, and
`Popen` finds it even if another ``bash.exe`` precedes it in ``PATH``, as on CI.
If WSL is absent, ``System32`` may still have ``bash.exe``, as Windows users and
administrators occasionally put executables there in lieu of extending ``PATH``.
"""
if os.name != "nt":
Expand All @@ -105,9 +108,7 @@ def check(cls):
except OSError as error:
return cls.WinError(error)

# FIXME: When not UTF-16LE: try local ANSI code page, then fall back to UTF-8.
encoding = "utf-16le" if b"\r\0\n\0" in process.stdout else "utf-8"
text = process.stdout.decode(encoding).rstrip() # stdout includes WSL errors.
text = cls._decode(process.stdout).rstrip() # stdout includes WSL's own errors.

if process.returncode == 1 and re.search(r"\bhttps://aka.ms/wslstore\b", text):
return cls.WslNoDistro(process, text)
Expand All @@ -121,8 +122,45 @@ def check(cls):
log.error("Strange output checking WSL status: %s", text)
return cls.CheckError(process, text)

@staticmethod
def _decode(stdout):
"""Decode ``bash.exe`` output as best we can. (This is used only on Windows.)"""
# When bash.exe is the WSL wrapper but the output is from WSL itself rather than
# code running in a distribution, the output is often in UTF-16LE, which Windows
# uses internally. The UTF-16LE representation of a Windows-style line ending is
# rarely seen otherwise, so use it to detect this situation.
if b"\r\0\n\0" in stdout:
return stdout.decode("utf-16le")

import winreg

# At this point, the output is probably either empty or not UTF-16LE. It's often
# UTF-8 from inside a WSL distro or a non-WSL bash shell. But our test command
# only uses the ASCII subset, so it's safe to guess wrong for that command's
# output. Errors from inside a WSL distro or non-WSL bash.exe are arbitrary, but
# unlike WSL's own messages, go to stderr, not stdout. So we can try the system
# active code page first. (Although console programs usually use the OEM code
# page, the ACP seems more accurate here. For example, on en-US Windows set to
# fr-FR, the message, if not UTF-16LE, is windows-1252, same as the ACP, while
# the OEM code page on such a system defaults to 437, which can't decode it.)
hklm_path = R"SYSTEM\CurrentControlSet\Control\Nls\CodePage"
with winreg.OpenKey(winreg.HKEY_LOCAL_MACHINE, hklm_path) as key:
value, _ = winreg.QueryValueEx(key, "ACP")
try:
return stdout.decode(f"cp{value}")
except UnicodeDecodeError:
pass
except LookupError as error:
log.warning("%s", str(error)) # Message already says "Unknown encoding:".

# Assume UTF-8. If we don't have valid UTF-8, substitute Unicode replacement
# characters. (For example, on zh-CN Windows set to fr-FR, error messages from
# WSL itself, if not UTF-16LE, are in windows-1252, even though the ACP and OEM
# code pages are 936; decoding as code page 936 or as UTF-8 both have errors.)
return stdout.decode("utf-8", errors="replace")


_win_bash_status = _WinBashStatus.check()
_win_bash_status = WinBashStatus.check()


def _make_hook(git_dir, name, content, make_exec=True):
Expand Down Expand Up @@ -994,7 +1032,7 @@ class Mocked:
self.assertEqual(rel, os.path.relpath(path, root))

@pytest.mark.xfail(
type(_win_bash_status) is _WinBashStatus.WslNoDistro,
type(_win_bash_status) is WinBashStatus.WslNoDistro,
reason="Currently uses the bash.exe for WSL even with no WSL distro installed",
raises=HookExecutionError,
)
Expand All @@ -1005,7 +1043,7 @@ def test_pre_commit_hook_success(self, rw_repo):
index.commit("This should not fail")

@pytest.mark.xfail(
type(_win_bash_status) is _WinBashStatus.WslNoDistro,
type(_win_bash_status) is WinBashStatus.WslNoDistro,
reason="Currently uses the bash.exe for WSL even with no WSL distro installed",
raises=AssertionError,
)
Expand All @@ -1016,7 +1054,7 @@ def test_pre_commit_hook_fail(self, rw_repo):
try:
index.commit("This should fail")
except HookExecutionError as err:
if type(_win_bash_status) is _WinBashStatus.Absent:
if type(_win_bash_status) is WinBashStatus.Absent:
self.assertIsInstance(err.status, OSError)
self.assertEqual(err.command, [hp])
self.assertEqual(err.stdout, "")
Expand All @@ -1032,12 +1070,12 @@ def test_pre_commit_hook_fail(self, rw_repo):
raise AssertionError("Should have caught a HookExecutionError")

@pytest.mark.xfail(
type(_win_bash_status) in {_WinBashStatus.Absent, _WinBashStatus.Wsl},
type(_win_bash_status) in {WinBashStatus.Absent, WinBashStatus.Wsl},
reason="Specifically seems to fail on WSL bash (in spite of #1399)",
raises=AssertionError,
)
@pytest.mark.xfail(
type(_win_bash_status) is _WinBashStatus.WslNoDistro,
type(_win_bash_status) is WinBashStatus.WslNoDistro,
reason="Currently uses the bash.exe for WSL even with no WSL distro installed",
raises=HookExecutionError,
)
Expand All @@ -1055,7 +1093,7 @@ def test_commit_msg_hook_success(self, rw_repo):
self.assertEqual(new_commit.message, "{} {}".format(commit_message, from_hook_message))

@pytest.mark.xfail(
type(_win_bash_status) is _WinBashStatus.WslNoDistro,
type(_win_bash_status) is WinBashStatus.WslNoDistro,
reason="Currently uses the bash.exe for WSL even with no WSL distro installed",
raises=AssertionError,
)
Expand All @@ -1066,7 +1104,7 @@ def test_commit_msg_hook_fail(self, rw_repo):
try:
index.commit("This should fail")
except HookExecutionError as err:
if type(_win_bash_status) is _WinBashStatus.Absent:
if type(_win_bash_status) is WinBashStatus.Absent:
self.assertIsInstance(err.status, OSError)
self.assertEqual(err.command, [hp])
self.assertEqual(err.stdout, "")
Expand Down

0 comments on commit 9ac2438

Please sign in to comment.