Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend benchmarks to zarr v3 #14

Open
falexwolf opened this issue Nov 5, 2024 · 4 comments
Open

Extend benchmarks to zarr v3 #14

falexwolf opened this issue Nov 5, 2024 · 4 comments
Assignees

Comments

@falexwolf
Copy link
Member

From a discussion with @ilan-gold (Slack ref).

@Koncopd, let's add this. Ilan is not in a particular rush though.

@falexwolf falexwolf changed the title Run dense array benchmark against zarr v3 Extend benchmarks to zarr v3 Dec 9, 2024
@falexwolf
Copy link
Member Author

Following up here with more detailed suggestions rom @ilan-gold regarding benchmark zarr v3 (Slack ref).

So the way I would do it is:

  • Add a benchmark for the single dataset case, using zarrv3 both with and without our zarrs extensions (https://github.com/ilan-gold/zarrs-python) using both the read_elem_as_dask API and using sparse_dataset API
  • For the multi-benchmark case, use the concat API to make one giant dataset all loaded with X form read_elem_as_dask, tested with and without zarrs extension

The branch is here: scverse/anndata#1726 for zarr v3. For writing out synthetic data, I would only write out the X matrix using ad.io.write_elem since I think string handling is a bit iffy (although you can try). Similarly for reading in the dataset I would only do something like AnnData(X=read_elem_as_dask("path_to.zarr")) . I don't think the indices make a huge difference but maybe I'm wrong.

We'll be improving the zarrs performance but I think it's in a good state. I'm seeing about a 2X boost in general over zarrv3, which should already be a big perf improvement. Also, I noticed you don't clear the disk cache in scripts or missed it. That might be skewing results!

@ilan-gold, can you clarify whether any of this is for streaming from S3 or is it all for local datasets?

@ilan-gold
Copy link
Contributor

It is only for local: ilan-gold/zarrs-python#44.

We also probably should make things truly async before testing the HTTP stuff but we can if you'd like once the PR is merged

@ilan-gold
Copy link
Contributor

ilan-gold commented Dec 9, 2024

Also it is only sparse. I am going to open a PR for integer indexing soon on dense, but it will not be done until next year due to the holiday

@falexwolf
Copy link
Member Author

Got it! 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants