Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python 3.8] UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 45: ordinal not in range(128) #149

Open
csgroen opened this issue Mar 26, 2021 · 5 comments

Comments

@csgroen
Copy link

csgroen commented Mar 26, 2021

Hello,

I've just made an updated conda environment for python 3.8 and I can't read loom files using anndata.read_loom() anymore. It gives me this error (see full traceback below):

Traceback (most recent call last):

  File "<ipython-input-2-b0b79aae2f29>", line 1, in <module>
    adata = anndata.read_loom('/home/clarice/Documents/SingleCell_PseudoTime/data/CHLA9.loom')

  File "/home/clarice/.local/lib/python3.8/site-packages/anndata/_io/read.py", line 225, in read_loom
    var = dict(lc.row_attrs)

  File "/home/clarice/anaconda3/lib/python3.8/site-packages/loompy/attribute_manager.py", line 102, in __getitem__
    return self.__getattr__(thing)

  File "/home/clarice/anaconda3/lib/python3.8/site-packages/loompy/attribute_manager.py", line 119, in __getattr__
    vals = loompy.materialize_attr_values(self.ds._file[a][name][:])

  File "/home/clarice/anaconda3/lib/python3.8/site-packages/loompy/normalize.py", line 98, in materialize_attr_values
    result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 45: ordinal not in range(128)

Of note: I can read the same file in my python3.7 environment, but it prints a message:

Variable names are not unique. To make them unique, call '.var_names_make_unique'.

It's always been like this. After running .var_names_make_unique, it all works out perfectly.

Any idea why UnicodeDecoder is failing? Is there anything I can do?

@SergejN
Copy link

SergejN commented May 16, 2021

I have the same issue. Here is the list of packages that are currently installed.

Package        Version
-------------- -------
click          8.0.0
h5py           3.2.1
llvmlite       0.36.0
loompy         3.0.6
numba          0.53.1
numpy          1.20.3
numpy-groupies 0.9.13
pip            21.1.1
scipy          1.6.3
setuptools     56.2.0
wheel          0.36.2

Thanks for any ideas on how to get it to run.

@SergejN
Copy link

SergejN commented May 17, 2021

Ok, I think I got it. It should also take care of #141 .
After some hours of debugging I realized that the file gencode.v31.metadata.tab, which I downloaded from https://storage.googleapis.com/linnarsson-lab-www-blobs/human_GRCh38_gencode.v31.tar.gz contains non-ASCII symbols:

[nowoshil@vieccews0302 human_GRCh38_gencode.v31.600]$ grep --color='auto' -P -n "[^\x00-\x7F]" gencode.v31.metadata.tab
33589:ENSG00000175634   ENSG00000175634.15      RPS6KB2 ribosomal protein S6 kinase B2  protein_coding  HGNC:10437      chr11   67428460        67435401   protein-coding gene     gene with protein product       11q13.2 11q13.2 "p70S6Kb|P70-BETA|STK14B|KLS|S6KB|S6Kbeta|S6Kβ" OTTHUMG00000167673uc001old.4       NM_003952       CCDS41677       Q9UBS0  "9878560|9804755"       MGI:1927343     RGD:1305144     RPS6KB2 608939          False
33759:ENSG00000110203   ENSG00000110203.9       FOLR3   folate receptor gamma   protein_coding  HGNC:3795       chr11   72114869        72139892  protein-coding gene      gene with protein product       11q13.4 11q13.4 "FR-G|FRγ"      OTTHUMG00000167870      uc031xur.2      NM_000804       CCDS73344  P41439  8110752                 FOLR3   602469          False
33764:ENSG00000110195   ENSG00000110195.13      FOLR1   folate receptor alpha   protein_coding  HGNC:3791       chr11   72189558        72196323  protein-coding gene      gene with protein product       11q13.4 11q13.4 FRα     OTTHUMG00000167876      uc001osa.3      NM_016725       CCDS8211  P15328   1717147 MGI:95568       RGD:71032       FOLR1   136430          False
33765:ENSG00000165457   ENSG00000165457.14      FOLR2   folate receptor beta    protein_coding  HGNC:3793       chr11   72216601        72221950  protein-coding gene      gene with protein product       11q13.4 11q13.4 FRβ     OTTHUMG00000150394      uc001ose.5      NM_000803       CCDS8212  P14207   "7698003|8110752"       MGI:95569       RGD:1308515     FOLR2   136425          False
44873:ENSG00000166501   ENSG00000166501.14      PRKCB   protein kinase C beta   protein_coding  HGNC:9395       chr16   23835983        24220611  protein-coding gene      gene with protein product       16p12.2-p12.1   16p12.2-p12.1   PKCβ    OTTHUMG00000131615      uc002dmd.4      NM_212535 "CCDS10618|CCDS10619"    P05771  3658678 MGI:97596       RGD:3396        PRKCB   176970          False
49067:ENSG00000154229   ENSG00000154229.12      PRKCA   protein kinase C alpha  protein_coding  HGNC:9393       chr17   66302613        66810743  protein-coding gene      gene with protein product       17q24.2 17q24.2 PKCα    OTTHUMG00000179533      uc002jfp.2      NM_002737       CCDS11664 P17252           MGI:97595       RGD:3395        PRKCA   176960          False
52643:ENSG00000105221   ENSG00000105221.17      AKT2    AKT serine/threonine kinase 2   protein_coding  HGNC:392        chr19   40230317        40285536   protein-coding gene     gene with protein product       19q13.2 19q13.2 PKBβ    OTTHUMG00000137375      uc002onf.3      NM_001626       "CCDS12552|CCDS82350"      P31751  1409633 MGI:104874      RGD:2082        AKT2    164731          False
53513:ENSG00000126583   ENSG00000126583.11      PRKCG   protein kinase C gamma  protein_coding  HGNC:9402       chr19   53879190        53907652  protein-coding gene      gene with protein product       19q13.42        19q13.42        "PKCC|MGC57564|PKCγ"    OTTHUMG00000064846      uc002qcq.2NM_002739        CCDS12867       P05129  "8432525|3755548"       MGI:97597       RGD:3397        PRKCG   176980          False
58592:ENSG00000089289   ENSG00000089289.16      IGBP1   immunoglobulin binding protein 1        protein_coding  HGNC:5461       chrX    70133447  70166324 protein-coding gene     gene with protein product       Xq13.1  Xq13.1  α4      OTTHUMG00000021767      uc004dxv.4      NM_001370192    CCDS14396  P78318  9441740 MGI:1346500     RGD:62011       IGBP1   300139          False
59609:ENSG00000129675   ENSG00000129675.16      ARHGEF6 Rac/Cdc42 guanine nucleotide exchange factor 6  protein_coding  HGNC:685        chrX    136665547  136780932       protein-coding gene     gene with protein product       Xq26.3  Xq26.3  "alphaPIX|Cool-2|KIAA0006|alpha-PIX|Cool2|αPix" OTTHUMG00000022518 uc004fab.5      NM_004840       "CCDS14660|CCDS78509"   Q15052  "7584048|9659915"       MGI:1920591     RGD:1359674     ARHGEF6 300267             False

I played around with the locale settings of my Docker container, but it didn't bring much. I ended up patching the file normalize.py as follows:

--- /usr/local/lib/python3.9/site-packages/loompy/normalize.py  2021-05-17 13:00:47.120228000 +0200
+++ /usr/local/lib/python3.9/site-packages/loompy/normalize.py  2021-05-17 13:00:47.120228000 +0200
@@ -95,7 +95,10 @@
                else:
                        temp = a
                # Then unescape XML entities and convert to unicode
-               result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+               try:
+                       result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+               except:
+                       result = np.array([html.unescape(x.decode("utf-8")) for x in temp], dtype=object)
        elif np.issubdtype(a.dtype, np.str_) or np.issubdtype(a.dtype, np.unicode_):
                result = np.array(a.astype(str), dtype=object)
        else:

I'm not sure how x.decode("utf-8") impacts the performance, therefore, the modified branch is only executed for the few lines above that would otherwise make UnicodeDecoder fail.

@stela2502
Copy link

Sad that this fix has not reached the official package.
I had to install from github to get this fix :-(

pip install git+https://github.com/linnarsson-lab/loompy.git

@slinnarsson
Copy link
Contributor

If you make a pull request I'm happy to accept it

@stela2502
Copy link

The fix is in your git code, but this is too new to be installed with "pip install loompy". So I likely just need to wait. Until then an install from git is sufficient to fix the problem. No pull request necessary any more. But Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants