You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset ds with ~60k rows and ~3 million columns. I'd like to retrieve certain columns (e.g. at most 100 at once), but the indexing is way too slow (e.g. ds[:, list_with_100_random_indices]). What is the recommended way to sample data efficiently from the dataset? Otherwise, is there a workaround (perhaps not using loompy?). This would be really useful to train machine learning models.
Thank you,
Ramon
The text was updated successfully, but these errors were encountered:
Hi - yes that's a weakness of the HDF5 format. Under the hood, that data is chunked (I think we use 64x64) so loading 100 columns really means loading 6400 columns.
One suggestion would be to first permute the columns (using the permute() method and a random permutation vector) and then to sample sets of adjacent columns. The permutation will take a long time, but you do it only once.
Hello,
I have a dataset
ds
with ~60k rows and ~3 million columns. I'd like to retrieve certain columns (e.g. at most 100 at once), but the indexing is way too slow (e.g.ds[:, list_with_100_random_indices]
). What is the recommended way to sample data efficiently from the dataset? Otherwise, is there a workaround (perhaps not using loompy?). This would be really useful to train machine learning models.Thank you,
Ramon
The text was updated successfully, but these errors were encountered: