Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError when read in loom #484

Closed
crazyhottommy opened this issue Dec 24, 2020 · 6 comments
Closed

UnicodeDecodeError when read in loom #484

crazyhottommy opened this issue Dec 24, 2020 · 6 comments
Labels

Comments

@crazyhottommy
Copy link

Hi,
When I read in the loom file

adata = ad.read_loom("mydata.loom")
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-903c3e692f43> in <module>
----> 1 adata = ad.read_loom("cd3_minus_B_cells_sub_5000cells.loom")

~/anaconda3/envs/stream/lib/python3.6/site-packages/anndata/readwrite/read.py in read_loom(filename, sparse, cleanup, X_name, obs_names, var_names, dtype, **kwargs)
    165             if key != '': layers[key] = lc.layers[key].sparse().T.tocsr() if sparse else lc.layers[key][()].T
    166 
--> 167         obs = dict(lc.col_attrs)
    168         if obs_names in obs.keys(): obs['obs_names'] = obs.pop(obs_names)
    169         obsm_attrs = [k for k, v in obs.items() if v.ndim > 1 and v.shape[1] > 1]

~/anaconda3/envs/stream/lib/python3.6/site-packages/loompy/attribute_manager.py in __getitem__(self, thing)
    100                                 return result
    101                 else:
--> 102                         return self.__getattr__(thing)
    103 
    104         def __getattr__(self, name: str) -> np.ndarray:

~/anaconda3/envs/stream/lib/python3.6/site-packages/loompy/attribute_manager.py in __getattr__(self, name)
    117                                 # Read values from the HDF5 file
    118                                 a = ["/row_attrs/", "/col_attrs/"][self.axis]
--> 119                                 vals = loompy.materialize_attr_values(self.ds._file[a][name][:])
    120                                 self.__dict__["storage"][name] = vals
    121                         return vals

~/anaconda3/envs/stream/lib/python3.6/site-packages/loompy/normalize.py in materialize_attr_values(a)
     96                         temp = a
     97                 # Then unescape XML entities and convert to unicode
---> 98                 result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
     99         elif np.issubdtype(a.dtype, np.str_) or np.issubdtype(a.dtype, np.unicode_):
    100                 result = np.array(a.astype(str), dtype=object)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)

It is a different subset of the Seurat object I converted to loom. I used the same code. The previous loom file can be read in without problems.

I googled and found https://stackoverflow.com/questions/10406135/unicodedecodeerror-ascii-codec-cant-decode-byte-0xd1-in-position-2-ordinal

How can I fix this?

Thanks!

@ivirshup
Copy link
Member

Hi!

I'm not completely sure, though at first glance it seems like an issue in loompy. Can you share some more information about your environment?

Ideally something like:

import anndata
from sinfo import sinfo

sinfo(dependencies=True)

Would you be able to share this file, or another one which gives you the same issue?

@crazyhottommy
Copy link
Author

Thanks! I had problems running sinfo

>>> sinfo(dependencies=True)
Traceback (most recent call last):
  File "/homes6/mtang/anaconda3/envs/stream/lib/python3.7/site-packages/sinfo/main.py", line 195, in sinfo
    mod_version = _find_version(mod.__version__)
AttributeError: module 'importlib_metadata' has no attribute '__version__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/homes6/mtang/anaconda3/envs/stream/lib/python3.7/site-packages/sinfo/main.py", line 198, in sinfo
    mod_version = _find_version(mod.version)
  File "/homes6/mtang/anaconda3/envs/stream/lib/python3.7/site-packages/sinfo/main.py", line 42, in _find_version
    return mod_version_attr()
TypeError: version() missing 1 required positional argument: 'distribution_name'

You can test the loom file at https://drive.google.com/file/d/1qYAhnunhtCdFbQxU_2zIXim25BloXzFY/view?usp=sharing

@ivirshup
Copy link
Member

I had problems running sinfo

Hmm, didn't know this had issues. Maybe just your environment info then? (Also, is this a different environment? Your previous example was with python 3.6, but this one is 3.7.)

You can test the loom file at

Thanks for linking that! It looks like I don't have permissions to access it at the moment, would you mind changing those?

@crazyhottommy
Copy link
Author

yes, this one is on HPC, the previous one is on my local computer, but both gave me the same error.
I just changed the permission https://drive.google.com/file/d/1qYAhnunhtCdFbQxU_2zIXim25BloXzFY/view?usp=sharing

Thanks for looking into it.

@ivirshup
Copy link
Member

ivirshup commented Dec 29, 2020

It looks like this is an issue in loompy (linnarsson-lab/loompy#141). Somehow the column was written with unicode values, but told hdf5 the values are ascii ("naïve" seems to be the culprit).

If you're using h5py>3, you can read the offending columns manually like this f["col_attrs"]["annotation"].asstr(encoding="utf-8")[:].

If you're using h5py<3, it's a bit more complicated, but something like this should work:

[x.decode() for x in f["col_attrs"]["annotation"]]

I'd say you should open an issue with whatever tool wrote this file, since it looks like the bug originated there.


I'm not sure what solutions are here past manually reading the values out of the file. If you copy the file, but encode strings as unicode, loompy throws a value error (at least with h5py>3, might be different with h5py<3). Maybe you could filter out or clean the data with unicode values?

The error
import anndata as ad
import loompy
import h5py
from functools import partial

def copy_elem(f, key, value):
    if isinstance(value, h5py.Group):
        f.create_group(key)
    elif isinstance(value, h5py.Dataset) and value.dtype.char == "S":
        f[key] = value.asstr(encoding="utf-8")[:]
    else:
        f.create_dataset(key, data=value)

with h5py.File("./test.loom", "r") as orig, h5py.File("./result.loom", "w") as result:
    orig.visititems(partial(copy_elem, result))

result = ad.read_loom("./result.loom")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-8d3722a2a313> in <module>
----> 1 result = ad.read_loom("./result.loom")

~/github/anndata/anndata/_io/read.py in read_loom(filename, sparse, cleanup, X_name, obs_names, obsm_names, var_names, varm_names, dtype, **kwargs)
    192     from loompy import connect
    193 
--> 194     with connect(filename, "r", **kwargs) as lc:
    195         if X_name not in lc.layers.keys():
    196             X_name = ""

/usr/local/lib/python3.8/site-packages/loompy/loompy.py in connect(filename, mode, validate, spec_version)
   1387                 Note: if validation is requested, an exception is raised if validation fails.
   1388 	"""
-> 1389         return LoomConnection(filename, mode, validate=validate)

/usr/local/lib/python3.8/site-packages/loompy/loompy.py in __init__(self, filename, mode, validate)
     80                         lv = loompy.LoomValidator()
     81                         if not lv.validate(filename):
---> 82                                 raise ValueError("\n".join(lv.errors) + f"\n{filename} does not appead to be a valid Loom file according to Loom spec version '{lv.version}'")
     83 
     84                 self._file = h5py.File(filename, mode)

ValueError: Row attribute 'Gene' dtype object is not allowed
Column attribute 'CellID' dtype object is not allowed
Column attribute 'ClusterName' dtype object is not allowed
Column attribute 'RNA_snn_res_1_5' dtype object is not allowed
Column attribute 'annotation' dtype object is not allowed
Column attribute 'bms_subj_id' dtype object is not allowed
Column attribute 'bor_by_irrc_may_2018' dtype object is not allowed
Column attribute 'cd3_neg_cell_number' dtype object is not allowed
Column attribute 'cd3_plus_cell_number' dtype object is not allowed
Column attribute 'cd3_status' dtype object is not allowed
Column attribute 'cohort' dtype object is not allowed
Column attribute 'cohort2' dtype object is not allowed
Column attribute 'group' dtype object is not allowed
Column attribute 'index' dtype object is not allowed
Column attribute 'orig_ident' dtype object is not allowed
Column attribute 'pbmc_sample_id' dtype object is not allowed
Column attribute 'pool_id' dtype object is not allowed
Column attribute 'seurat_clusters' dtype object is not allowed
Column attribute 'singleR_cluster' dtype object is not allowed
Column attribute 'singleR_cluster_main' dtype object is not allowed
Column attribute 'subject_id' dtype object is not allowed
Column attribute 'tigl_id' dtype object is not allowed
Column attribute 'treatment_cycle' dtype object is not allowed
Column attribute 'type' dtype object is not allowed
For help, see http://linnarssonlab.org/loompy/format/
./result.loom does not appead to be a valid Loom file according to Loom spec version '0.0.0'

@crazyhottommy
Copy link
Author

Thank you! finding the offending characters (naive) helped a lot. I will fix that on my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants