Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to read hdf file in python #3

Open
gauravsinghrathore opened this issue Nov 2, 2022 · 7 comments
Open

unable to read hdf file in python #3

gauravsinghrathore opened this issue Nov 2, 2022 · 7 comments

Comments

@gauravsinghrathore
Copy link

gauravsinghrathore commented Nov 2, 2022

Screenshot 2022-11-02 at 11 48 03
Screenshot 2022-11-02 at 11 48 15
Screenshot 2022-11-02 at 11 48 22
Dear Linnarsson lab,
Thank you very much for publishing this fantastic and precious dataset.
However, I am struggling to read the h5 file posted. with both h5py and scanpy. Looks like the problem lies in key (which i presume is 'shoji') of h5 file.

@slinnarsson
Copy link
Contributor

Under shoji there are many keys. Try this

hdf[”shoji”].keys()
hdf[”shoji”][”CellID”][:10]

@gauravsinghrathore
Copy link
Author

gauravsinghrathore commented Nov 2, 2022

Thanks for the swift reply. :) this works
since this a ultra huge dataset i now know why loom is not helpful.

Although i wrote a quick work around to extract data from h5 file and make scanpy compatible adata.
But I suggest to have at least 1 TB RAM

load basic python libraries###

import numpy as np
import h5py
import pandas as pd
import scanpy as sc
import matplotlib as plt

read the hdf file using h5py####

hdf = h5py.File('/scratch/kgr851/datasets/human_brain_development/linnarsson/HumanFetalBrainPool.h5','r')

i think 'Expression' is count matrix

make adata then add other metadata as adata.obs#

expr = hdf['shoji']['Expression'][:1665937]
a = sc.AnnData(expr)

extract cell names and gene names and add to adata

genes = np.array(hdf['shoji']['Gene'][:], dtype=str)
cells = np.array(hdf['shoji']['CellID'][:1665937], dtype=str)
a.obs_names = cells
a.var_names = genes

read columns you are interested in to add as obs

####(remeber to keep it small and eqvivalant to number of cells) ( you can also loop it)
a.obs['CellID']= (np.array(hdf['shoji']['CellID'][:1665937], dtype=str))
a.obs['CellClass']= (np.array(hdf['shoji']['CellClass'][:1665937], dtype=str))
a.obs['Chemistry']= (np.array(hdf['shoji']['Chemistry'][:1665937], dtype=str))
a.obs['Tissue'] = (np.array(hdf['shoji']['Tissue'][:1665937], dtype=str))
a.obs['Subregion'] = (np.array(hdf['shoji']['Subregion'][:1665937], dtype=str))
a.obs['Age']= (np.array(hdf['shoji']['Age'][:1665937], dtype=str))
a.obs.head(20)

###time to add embedding #####
embedding = hdf['shoji']['Embedding'][:1665937]
a.obsm['X_embedding'] = embedding

with plt.rc_context({"figure.figsize": [10, 10]}):
sc.pl.embedding(a, basis="X_embedding", color=["Age"], frameon=False, size=2)

adata is ready you can save it h5ad format

a.write('/scratch/kgr851/datasets/human_brain_development/linnarsson/hbd_atlas.h5ad')

you can also save as loom file

@bobermayer
Copy link

here's another way of reading the gene expression matrix chunk by chunk for a (random) subsample of cells, avoiding the creation of large dense arrays:

import sys
import numpy as np
import scanpy as sc
import h5py
import anndata
import scipy.sparse

hdf=h5py.File('HumanFetalBrainPool.h5','r')
mat=hdf['shoji']['Expression']

ncells=100000
cells=np.where(np.random.sample(mat.shape[0]) < ncells/mat.shape[0])[0]
tot_chunks=int(np.prod([np.ceil(mat.shape[k]/mat.chunks[k]) for k in range(2)]))

data=[]
row_inds=[]
col_inds=[]
for n,(row_slice,col_slice) in enumerate(mat.iter_chunks()):
    if np.logical_and(cells >= row_slice.start, cells < row_slice.stop).any():
        chunk=scipy.sparse.coo_matrix(mat[(row_slice,col_slice)])
        data.append(chunk.data)
        row_inds.append(chunk.row+row_slice.start)
        col_inds.append(chunk.col+col_slice.start)
    if n%100==0:
        sys.stderr.write('{0:.1f}% of {1}k chunks read\r'.format(100*n/tot_chunks,tot_chunks//1000))
sys.stderr.write('{0} chunks read\n'.format(n+1))

sys.stderr.write('constructing DGE\n')
dge=scipy.sparse.csr_matrix((np.concatenate(data),
                             (np.concatenate(row_inds),
                              np.concatenate(col_inds))),
                            shape=mat.shape)

fetal=anndata.AnnData(dge[cells])

@gauravsinghrathore
Copy link
Author

But it still does not reads the metadata info with it. SO one will have to match index in the end.

@bobermayer
Copy link

well, you continue as above (but mapping the subcluster IDs is slightly more complicated)

fetal.obs_names = np.array(hdf['shoji']['CellID'][cells], dtype=str)
fetal.var_names = np.array(hdf['shoji']['Gene'][:], dtype=str)

fetal.obs=pd.DataFrame(dict((k,np.array(hdf['shoji'][k][cells], dtype=str))
                            for k in ['Age','CellClass','Chemistry',
                                      'Donor','Region','SampleID','Subdivision',
                                      'Subregion','Tissue']),
                      index=fetal.obs_names)
                            
for k in ['Clusters','TopLevelCluster']:
    fetal.obs[k]=np.array(hdf['shoji'][k][cells], dtype=int)

fetal.obsm['X_tsne'] = hdf['shoji']['Embedding'][cells]

@gauravsinghrathore
Copy link
Author

Fantastic. :)

@yojetsharma
Copy link

import numpy as np import h5py import pandas as pd import scanpy as sc import matplotlib as plt

if possible, can you please share the h5ad file that you generated? Don't have the privilege of 1 TB RAM :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants