Indexing too slow #144

rvinas · 2020-12-16T16:29:15Z

Hello,

I have a dataset ds with ~60k rows and ~3 million columns. I'd like to retrieve certain columns (e.g. at most 100 at once), but the indexing is way too slow (e.g. ds[:, list_with_100_random_indices]). What is the recommended way to sample data efficiently from the dataset? Otherwise, is there a workaround (perhaps not using loompy?). This would be really useful to train machine learning models.

Thank you,
Ramon

The text was updated successfully, but these errors were encountered:

slinnarsson · 2020-12-17T19:01:02Z

Hi - yes that's a weakness of the HDF5 format. Under the hood, that data is chunked (I think we use 64x64) so loading 100 columns really means loading 6400 columns.

One suggestion would be to first permute the columns (using the permute() method and a random permutation vector) and then to sample sets of adjacent columns. The permutation will take a long time, but you do it only once.

For really high performance, the best is likely to create a dense raw matrix on disk and use numpy memory mapped arrays (https://numpy.org/doc/stable/reference/generated/numpy.memmap.html). That should let you read near the speed of the disk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing too slow #144

Indexing too slow #144

rvinas commented Dec 16, 2020 •

edited

Loading

slinnarsson commented Dec 17, 2020

Indexing too slow #144

Indexing too slow #144

Comments

rvinas commented Dec 16, 2020 • edited Loading

slinnarsson commented Dec 17, 2020

rvinas commented Dec 16, 2020 •

edited

Loading