unable to read hdf file in python #3

gauravsinghrathore · 2022-11-02T10:48:25Z

Dear Linnarsson lab,
Thank you very much for publishing this fantastic and precious dataset.
However, I am struggling to read the h5 file posted. with both h5py and scanpy. Looks like the problem lies in key (which i presume is 'shoji') of h5 file.

slinnarsson · 2022-11-02T11:57:36Z

Under shoji there are many keys. Try this

hdf[”shoji”].keys()
hdf[”shoji”][”CellID”][:10]

gauravsinghrathore · 2022-11-02T17:45:30Z

Thanks for the swift reply. :) this works
since this a ultra huge dataset i now know why loom is not helpful.

Although i wrote a quick work around to extract data from h5 file and make scanpy compatible adata.
But I suggest to have at least 1 TB RAM

load basic python libraries###

import numpy as np
import h5py
import pandas as pd
import scanpy as sc
import matplotlib as plt

read the hdf file using h5py####

hdf = h5py.File('/scratch/kgr851/datasets/human_brain_development/linnarsson/HumanFetalBrainPool.h5','r')

i think 'Expression' is count matrix

make adata then add other metadata as adata.obs#

expr = hdf['shoji']['Expression'][:1665937]
a = sc.AnnData(expr)

extract cell names and gene names and add to adata

genes = np.array(hdf['shoji']['Gene'][:], dtype=str)
cells = np.array(hdf['shoji']['CellID'][:1665937], dtype=str)
a.obs_names = cells
a.var_names = genes

read columns you are interested in to add as obs

####(remeber to keep it small and eqvivalant to number of cells) ( you can also loop it)
a.obs['CellID']= (np.array(hdf['shoji']['CellID'][:1665937], dtype=str))
a.obs['CellClass']= (np.array(hdf['shoji']['CellClass'][:1665937], dtype=str))
a.obs['Chemistry']= (np.array(hdf['shoji']['Chemistry'][:1665937], dtype=str))
a.obs['Tissue'] = (np.array(hdf['shoji']['Tissue'][:1665937], dtype=str))
a.obs['Subregion'] = (np.array(hdf['shoji']['Subregion'][:1665937], dtype=str))
a.obs['Age']= (np.array(hdf['shoji']['Age'][:1665937], dtype=str))
a.obs.head(20)

###time to add embedding #####
embedding = hdf['shoji']['Embedding'][:1665937]
a.obsm['X_embedding'] = embedding

with plt.rc_context({"figure.figsize": [10, 10]}):
sc.pl.embedding(a, basis="X_embedding", color=["Age"], frameon=False, size=2)

adata is ready you can save it h5ad format

a.write('/scratch/kgr851/datasets/human_brain_development/linnarsson/hbd_atlas.h5ad')

you can also save as loom file

bobermayer · 2022-12-22T13:22:59Z

here's another way of reading the gene expression matrix chunk by chunk for a (random) subsample of cells, avoiding the creation of large dense arrays:

import sys
import numpy as np
import scanpy as sc
import h5py
import anndata
import scipy.sparse

hdf=h5py.File('HumanFetalBrainPool.h5','r')
mat=hdf['shoji']['Expression']

ncells=100000
cells=np.where(np.random.sample(mat.shape[0]) < ncells/mat.shape[0])[0]
tot_chunks=int(np.prod([np.ceil(mat.shape[k]/mat.chunks[k]) for k in range(2)]))

data=[]
row_inds=[]
col_inds=[]
for n,(row_slice,col_slice) in enumerate(mat.iter_chunks()):
    if np.logical_and(cells >= row_slice.start, cells < row_slice.stop).any():
        chunk=scipy.sparse.coo_matrix(mat[(row_slice,col_slice)])
        data.append(chunk.data)
        row_inds.append(chunk.row+row_slice.start)
        col_inds.append(chunk.col+col_slice.start)
    if n%100==0:
        sys.stderr.write('{0:.1f}% of {1}k chunks read\r'.format(100*n/tot_chunks,tot_chunks//1000))
sys.stderr.write('{0} chunks read\n'.format(n+1))

sys.stderr.write('constructing DGE\n')
dge=scipy.sparse.csr_matrix((np.concatenate(data),
                             (np.concatenate(row_inds),
                              np.concatenate(col_inds))),
                            shape=mat.shape)

fetal=anndata.AnnData(dge[cells])

gauravsinghrathore · 2023-02-02T15:23:54Z

But it still does not reads the metadata info with it. SO one will have to match index in the end.

bobermayer · 2023-02-02T15:34:23Z

well, you continue as above (but mapping the subcluster IDs is slightly more complicated)

fetal.obs_names = np.array(hdf['shoji']['CellID'][cells], dtype=str)
fetal.var_names = np.array(hdf['shoji']['Gene'][:], dtype=str)

fetal.obs=pd.DataFrame(dict((k,np.array(hdf['shoji'][k][cells], dtype=str))
                            for k in ['Age','CellClass','Chemistry',
                                      'Donor','Region','SampleID','Subdivision',
                                      'Subregion','Tissue']),
                      index=fetal.obs_names)
                            
for k in ['Clusters','TopLevelCluster']:
    fetal.obs[k]=np.array(hdf['shoji'][k][cells], dtype=int)

fetal.obsm['X_tsne'] = hdf['shoji']['Embedding'][cells]

gauravsinghrathore · 2023-02-02T15:36:40Z

Fantastic. :)

yojetsharma · 2024-08-18T18:35:49Z

import numpy as np import h5py import pandas as pd import scanpy as sc import matplotlib as plt

if possible, can you please share the h5ad file that you generated? Don't have the privilege of 1 TB RAM :/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to read hdf file in python #3

unable to read hdf file in python #3

gauravsinghrathore commented Nov 2, 2022 •

edited

Loading

slinnarsson commented Nov 2, 2022

gauravsinghrathore commented Nov 2, 2022 •

edited

Loading

bobermayer commented Dec 22, 2022

gauravsinghrathore commented Feb 2, 2023

bobermayer commented Feb 2, 2023

gauravsinghrathore commented Feb 2, 2023

yojetsharma commented Aug 18, 2024

unable to read hdf file in python #3

unable to read hdf file in python #3

Comments

gauravsinghrathore commented Nov 2, 2022 • edited Loading

slinnarsson commented Nov 2, 2022

gauravsinghrathore commented Nov 2, 2022 • edited Loading

load basic python libraries###

read the hdf file using h5py####

i think 'Expression' is count matrix

make adata then add other metadata as adata.obs#

extract cell names and gene names and add to adata

read columns you are interested in to add as obs

adata is ready you can save it h5ad format

you can also save as loom file

bobermayer commented Dec 22, 2022

gauravsinghrathore commented Feb 2, 2023

bobermayer commented Feb 2, 2023

gauravsinghrathore commented Feb 2, 2023

yojetsharma commented Aug 18, 2024

gauravsinghrathore commented Nov 2, 2022 •

edited

Loading

gauravsinghrathore commented Nov 2, 2022 •

edited

Loading