Speed up anndata writing speed after define_clonotype_clusters #556

felixpetschko · 2024-09-18T17:27:51Z

The writing speed of the anndata object after the define_clonotype_clusters function is too slow which is currently one of the biggest bottlenecks in scirpy's analysis pipeline. The reason for that is the clonotype_id -> cell_ids mapping (dict[str, np.ndarray[str]) that gets stored to the anndata.uns attribute. Now I implemented a version where this mapping has the datatype dict[str, list[str]) and gets converted to a json object before storing it. This is quite fast and is very similar to the previous implementation such that only minor changes in the rest of the code were necessary.

…he anndata result

for more information, see https://pre-commit.ci

felixpetschko · 2024-09-18T17:57:49Z

I made some measurements on my laptop:

In the first image we can see why we have a problem. Storing the anndata object takes ~28 times longer than the define_clonotype_clusters function itself.

With the json approach the storage time can be reduced drastically and for around 100k cells it takes around the same execution time as the function itself.

Here we see the speedup of the new approach.

into result_storage

grst · 2024-09-20T08:42:22Z

Maybe this can be fixed in anndata directly. There's no good reason for this to be slow. I'll open a ticket.

If they cant or dont want to fix it, then we can go with this workaround.

grst · 2024-09-21T12:55:16Z

scverse/anndata#1684

felixpetschko · 2024-11-05T12:33:17Z

@grst: according to the discussion in scverse/anndata#1684, it seems that it could take some time to fix this issue on their side.

In the discussion you came up with the following example:

import anndata
import numpy as np

adata = anndata.AnnData()
adata.uns["x"] = {str(i): np.array(str(i), dtype="object") for i in range(20000)}

# %%time
adata.write_h5ad("/tmp/anndata.h5ad")

# %%time
anndata.read_h5ad("/tmp/anndata.h5ad")

If we stay with that example, the solution on our side looks like this:

import anndata
import numpy as np
import json

adata = anndata.AnnData()
adata.uns["x"] = json.dumps({str(i): np.array(str(i), dtype="object").tolist() for i in range(20000)})

# %%time
adata.write_h5ad("/tmp/anndata.h5ad")

# %%time
anndata.read_h5ad("/tmp/anndata.h5ad")

The speedup is quite high - seconds vs. milliseconds. I don't know if they could use something like this on their side (not sure which data types are allowed), otherwise we could use the solution in this PR for now.

grst · 2024-11-05T19:02:47Z

I agree, fixing this in AnnData would take too long and the proposed solution only addresses storing as zarr but not h5ad. Let's go with the JSON solution you proposed here then.

felixpetschko and others added 3 commits September 17, 2024 18:38

convert cell_indices str->array dict to a csr matrix before storing t…

f1c40fa

…he anndata result

save cell_indices as json format

97e13b7

[pre-commit.ci] auto fixes from pre-commit.com hooks

78124f1

for more information, see https://pre-commit.ci

felixpetschko added 2 commits September 18, 2024 20:15

removed unused conversion function

bca630e

Merge branch 'result_storage' of https://github.com/felixpetschko/scirpy

6166305

into result_storage

grst added 2 commits November 6, 2024 09:59

Merge branch 'main' into result_storage

3e4d386

Update CHANGELOG

6be5b2a

grst approved these changes Nov 6, 2024

View reviewed changes

grst marked this pull request as ready for review November 6, 2024 19:22

grst merged commit a9fe4d1 into scverse:main Nov 6, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up anndata writing speed after define_clonotype_clusters #556

Speed up anndata writing speed after define_clonotype_clusters #556

felixpetschko commented Sep 18, 2024

felixpetschko commented Sep 18, 2024

grst commented Sep 20, 2024

grst commented Sep 21, 2024

felixpetschko commented Nov 5, 2024

grst commented Nov 5, 2024

Speed up anndata writing speed after define_clonotype_clusters #556

Speed up anndata writing speed after define_clonotype_clusters #556

Conversation

felixpetschko commented Sep 18, 2024

felixpetschko commented Sep 18, 2024

grst commented Sep 20, 2024

grst commented Sep 21, 2024

felixpetschko commented Nov 5, 2024

grst commented Nov 5, 2024