-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up anndata writing speed after define_clonotype_clusters #556
Conversation
…he anndata result
for more information, see https://pre-commit.ci
Maybe this can be fixed in anndata directly. There's no good reason for this to be slow. I'll open a ticket. If they cant or dont want to fix it, then we can go with this workaround. |
@grst: according to the discussion in scverse/anndata#1684, it seems that it could take some time to fix this issue on their side. In the discussion you came up with the following example:
If we stay with that example, the solution on our side looks like this:
The speedup is quite high - seconds vs. milliseconds. I don't know if they could use something like this on their side (not sure which data types are allowed), otherwise we could use the solution in this PR for now. |
I agree, fixing this in AnnData would take too long and the proposed solution only addresses storing as zarr but not h5ad. Let's go with the JSON solution you proposed here then. |
The writing speed of the anndata object after the define_clonotype_clusters function is too slow which is currently one of the biggest bottlenecks in scirpy's analysis pipeline. The reason for that is the clonotype_id -> cell_ids mapping (dict[str, np.ndarray[str]) that gets stored to the anndata.uns attribute. Now I implemented a version where this mapping has the datatype dict[str, list[str]) and gets converted to a json object before storing it. This is quite fast and is very similar to the previous implementation such that only minor changes in the rest of the code were necessary.