Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems #12

Open
d-01 opened this issue Dec 12, 2018 · 3 comments
Open

Encoding problems #12

d-01 opened this issue Dec 12, 2018 · 3 comments

Comments

@d-01
Copy link

d-01 commented Dec 12, 2018

I have encountered a problem with cyrillic text encoding.
From Windows Explorer:

files-list

From powershell console:

PS> ls |% name

cyrillic_7_chars=русский.txt
text-1251.txt
text-utf8.txt

PS> gc text-1251.txt

русский

PS> gc text-utf8.txt

С?С?С?С?РєРёР№

From Jupyter Notebook:

PS> ls |% name

cyrillic_7_chars=■■■■txt
text-1251.txt
text-utf8.txt

PS> gc text-1251.txt

■■■■

PS> gc text-utf8.txt

русский

I have found a workaround, but not sure how to apply this to fix the problem:

PS> [Text.Encoding]::Default.GetString([Text.Encoding]::UTF8.GetBytes((ls |% name) -join "`n"))

cyrillic_7_chars=русский.txt
text-1251.txt
text-utf8.txt

Environment information:

PS> [System.Text.Encoding]::Default

IsSingleByte      : True
BodyName          : koi8-r
EncodingName      : Cyrillic (Windows)
HeaderName        : windows-1251
...

PS> $psversiontable

Name                           Value                                           
----                           -----                                           
PSVersion                      5.1.14409.1005                                  
PSEdition                      Desktop                                         
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}                         
BuildVersion                   10.0.14409.1005                                 
CLRVersion                     4.0.30319.42000                                 
WSManStackVersion              3.0                                             
PSRemotingProtocolVersion      2.3                                             
SerializationVersion           1.1.0.1 

The version of the notebook server is: 5.6.0
The server is running on this version of Python: Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
Kernel info:

Name: powershell-kernel
Version: 0.0.8
Home-page: https://github.com/vors/jupyter-powershell
Author: Sergei Vorobev
Author-email: [email protected]

What else I've tried so far:

  1. Changing $OutputEncoding global variable
  2. Changing [console]::OutputEncoding
  3. Changing [console]::InputEncoding
  4. chcp 866 – doing nothing to cmd /cdir and Get-ChildItem / ls output
  5. chcp 65001 – fixes cmd /cdir but not Get-ChildItem / ls output
  6. Different browsers: Firefox, Chrome, IE11

Standard kernel (IPython 6.5.0) works fine:
In:

import os
os.listdir()

Out:

['cyrillic_7_chars=русский.txt', 'text-1251.txt', 'text-utf8.txt']

From powershell console:

PS> [text.encoding]::Default.getbytes('русский') | format-hex

00000000   F0 F3 F1 F1 EA E8 E9                             ðóññêèé

PS> [text.encoding]::utf8.getbytes('русский') | format-hex

00000000   D1 80 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9        ����кий

From Jupyter Notebook:

PS> [text.encoding]::Default.getbytes('русский') | format-hex

00000000   D1 80 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9        N?N?N?N???????  

PS> [text.encoding]::utf8.getbytes('русский') | format-hex

00000000   D0 A1 D0 82 D0 A1 D1 93 D0 A1 D0 83 D0 A1 D0 83  ??????N?????????
00000010   D0 A0 D1 94 D0 A0 D1 91 D0 A0 E2 84 96           ?■N??■N??■a??   
@vors
Copy link
Owner

vors commented Dec 16, 2018

Thank you for the detailed report!
My uneducated guess would be that our python repl_process abstraction expects utf-8 but powershell by default uses utf-16, or perhaps that we incorrectly do increment decoding in the kernel. The kernel itself is relative small, I think you should have no troubles debugging it with changing the kernel code. I would not have time to do it any time soon myself, but I'm happy to help you navigate the code and code review any changes.

@d-01
Copy link
Author

d-01 commented May 29, 2019

Problem solved:

--- a/subprocess_repl.py.orig
+++ b/subprocess_repl.py
@@ -9,10 +9,16 @@ import os
 import sys
 import re
 import signal
+import locale
 from subprocess import Popen
 from codecs import getencoder, getincrementaldecoder

 PY3 = sys.version_info[0] == 3
+# On Windows encoding expected to be something like 'cp1252' (en) or 'cp1251' (ru)
+# depending on system-wide "System locale" setting.
+# Path to setting: Region and Language -> Administrative (tab) ->
+# -> Language for non-Unicode programs -> Change system locale...
+ENCODING = locale.getpreferredencoding()

 if os.name == 'posix':
     POSIX = True
@@ -23,8 +29,8 @@ else:

 class SubprocessRepl(object):
     def __init__(self, cmd):
-        self.encoder = getencoder('utf8')
-        self.decoder = getincrementaldecoder('utf8')()
+        self.encoder = getencoder(ENCODING)
+        self.decoder = getincrementaldecoder(ENCODING)()
         self.popen = Popen(cmd, bufsize=1,
             stderr=subprocess.STDOUT, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
         if POSIX:
@@ -60,7 +66,7 @@ class SubprocessRepl(object):
         si.flush()

     def reset_decoder(self):
-        self.decoder = getincrementaldecoder('utf8')()
+        self.decoder = getincrementaldecoder(ENCODING)()

     def read(self):
         """Reads at least one decoded char of output"""

@vors
Copy link
Owner

vors commented Jun 15, 2019

Nice! @d-01 would you mind to send a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants