Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between cell metadata and expression matrix #28

Open
rwollman opened this issue Sep 17, 2023 · 3 comments
Open

Mismatch between cell metadata and expression matrix #28

rwollman opened this issue Sep 17, 2023 · 3 comments

Comments

@rwollman
Copy link

In the 20230830 release, there is a mismatch in the number of cells between the expression matrix and metadata for the Allen MERFISH data. Metadata has 3938808 cells, and the expression matrix has 4334174 cells.

metadata was loaded with:
rpath = metadata['cell_metadata']['files']['csv']['relative_path']
file = os.path.join( download_base, rpath)
cell = pd.read_csv(file, dtype={"cell_label":str})
cell.shape

expression was loaded with:
download_base = '/orangedata/ExternalData/Allen_WMB_2023Sep05'
filename = expression_matrices['C57BL6J-638850']['raw']['files']['h5ad']['relative_path']
adata = anndata.read_h5ad(os.path.join(download_base,filename))
adata.shape

Both of these numbers are different than the number of cells in 20230630 where both datasets had the same number of cells at 4330907.

If the cell numbers are not the same, the spatial data becomes useless, as you can't correspond between cells and xy position. For example, I suspect that the notebooks merfish_tutorial_1,2a,2b show inaccurate maps of gene expression due to this issue (depending on how filtered cells are distributed across sections).

@tmchartrand
Copy link

I can't explain the number mismatch, but expect it's due to changes in some QC criteria - maybe @mkunst23 can?
Just to note though, this is not an issue for using the remaining data as long as you join the anndata and metadata properly using the cell IDs.

@rwollman
Copy link
Author

Thanks, you are correct that I can avoid this with a proper merge. My bad and thanks for pointing this out.

@mkunst23
Copy link

Yes, the 4334174 cells are before filtering out cells with low average correlation scores (<0.5) when mapped against the reference taxonomy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants