Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up anndata writing speed after define_clonotype_clusters #556

Merged
merged 7 commits into from
Nov 6, 2024

Conversation

felixpetschko
Copy link
Collaborator

The writing speed of the anndata object after the define_clonotype_clusters function is too slow which is currently one of the biggest bottlenecks in scirpy's analysis pipeline. The reason for that is the clonotype_id -> cell_ids mapping (dict[str, np.ndarray[str]) that gets stored to the anndata.uns attribute. Now I implemented a version where this mapping has the datatype dict[str, list[str]) and gets converted to a json object before storing it. This is quite fast and is very similar to the previous implementation such that only minor changes in the rest of the code were necessary.

@felixpetschko
Copy link
Collaborator Author

I made some measurements on my laptop:

image
In the first image we can see why we have a problem. Storing the anndata object takes ~28 times longer than the define_clonotype_clusters function itself.

Bild_2024-09-18_194502953
With the json approach the storage time can be reduced drastically and for around 100k cells it takes around the same execution time as the function itself.

Bild_2024-09-18_194533408
Here we see the speedup of the new approach.

@grst
Copy link
Collaborator

grst commented Sep 20, 2024

Maybe this can be fixed in anndata directly. There's no good reason for this to be slow. I'll open a ticket.

If they cant or dont want to fix it, then we can go with this workaround.

@grst
Copy link
Collaborator

grst commented Sep 21, 2024

scverse/anndata#1684

@felixpetschko
Copy link
Collaborator Author

@grst: according to the discussion in scverse/anndata#1684, it seems that it could take some time to fix this issue on their side.

In the discussion you came up with the following example:

import anndata
import numpy as np

adata = anndata.AnnData()
adata.uns["x"] = {str(i): np.array(str(i), dtype="object") for i in range(20000)}

# %%time
adata.write_h5ad("/tmp/anndata.h5ad")

# %%time
anndata.read_h5ad("/tmp/anndata.h5ad")

If we stay with that example, the solution on our side looks like this:

import anndata
import numpy as np
import json

adata = anndata.AnnData()
adata.uns["x"] = json.dumps({str(i): np.array(str(i), dtype="object").tolist() for i in range(20000)})

# %%time
adata.write_h5ad("/tmp/anndata.h5ad")

# %%time
anndata.read_h5ad("/tmp/anndata.h5ad") 

The speedup is quite high - seconds vs. milliseconds. I don't know if they could use something like this on their side (not sure which data types are allowed), otherwise we could use the solution in this PR for now.

@grst
Copy link
Collaborator

grst commented Nov 5, 2024

I agree, fixing this in AnnData would take too long and the proposed solution only addresses storing as zarr but not h5ad. Let's go with the JSON solution you proposed here then.

@grst grst marked this pull request as ready for review November 6, 2024 19:22
@grst grst merged commit a9fe4d1 into scverse:main Nov 6, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants