Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening HDF5 files in Lua takes a long time #112

Open
Amir-Arsalan opened this issue Apr 25, 2019 · 7 comments
Open

Opening HDF5 files in Lua takes a long time #112

Amir-Arsalan opened this issue Apr 25, 2019 · 7 comments

Comments

@Amir-Arsalan
Copy link

Amir-Arsalan commented Apr 25, 2019

I have created some HDF5 files using the h5py package in Python and load them in Lua using this package. The file sizes vary (from 10GBs to 100GBs or more) and opening them in Python is instantaneous. However, opening the same HDF5 files in Lua takes a very very long time using this package when I am calling hdf5.open(), before I even read any data. Sometimes it takes 1 minute or more to open one file even. I can open the same file in Python within less than half a second.

I wonder if anyone has had this issue before?

@Amir-Arsalan
Copy link
Author

@d11 I would appreciate if you can give me a clue on what I should look into to fix this. Opening the HDF5 files is unbelievably slow. A Python script that does same exact thing opens all of my files within less than 0.1 of a second.

@Amir-Arsalan
Copy link
Author

@d11 Does this package have some assumption on how people make their data sets? I don't do any chunking or anything on the Python side when I'm creating the data set ...

@d11
Copy link
Contributor

d11 commented Apr 29, 2019

Hi, I'm not sure about this I'm afraid. This package does traverse the file when opening it, to determine the whole structure up front - perhaps even that is too slow in your case. It may be that the proper scalable thing is for this to be more lazy, but it was not necessary in our usage. In general you should know that torch-hdf5 is not as mature as h5py; while in principle HDF5 itself works fine for large datasets, this library has mainly been used for transferring smaller amounts of data between languages / programs in a convenient way. If you need to get around this I'd start by disabling the _loadObject call that is triggered when opening the file - it won't work without that but you can see if it becomes fast, which would confirm the idea that the traversal of the file is the cause of the slowness. To actually use the library without doing this up front, however, might require more invasive changes.

@Amir-Arsalan
Copy link
Author

@d11 Thank you for the information. Do you think I should comment these lines?

@d11
Copy link
Contributor

d11 commented Apr 29, 2019

Right, that would skip the initial traversal of the dataset that I mentioned. The library will not work without it, but it might at least confirm the cause of the problem.

@Amir-Arsalan
Copy link
Author

@d11 I think there is more to fix. Not only the file opening is super slow, but reading data is also very slow. Do you have an idea on why reading data might be too slow as well?

@d11
Copy link
Contributor

d11 commented May 3, 2019

I don't really have any guesses about that, sorry. Perhaps the dataspace used by torch-hdf5 is not suitable for your access patterns. https://support.hdfgroup.org/HDF5/doc/UG/HDF5_Users_Guide-Responsive%20HTML5/HDF5_Users_Guide/Dataspaces/HDF5_Dataspaces_and_Partial_I_O.htm?rhtocid=7.2#TOC_7_4_Dataspaces_and_Databc-6 describes how this can work in HDF5. If this is crucial for your performance you will probably need to use a lower level interface than torch-hdf5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants