Move store backing calculation out of dataset writer #441

For any section that is not the last section, approximately two chunk copies need to be accounted for: - the unpadded chunk that the writer of section `n` creates - the padded chunk that the reader of section `n+1` creates For the last section in a pipeline, only one chunk copy needs to be accounted for: the unpadded chunk that the writer of that section creates.

Prior to this change, the decision of whether the dataset store was backed by RAM or an hdf5 file was made by the writer. Specifically, the writer was deciding whether the store should be backed by RAM or an hdf5 file when the first block was written to the store. This made a lot of sense when the determining of the store-backing was done by: - attempting to create a numpy array with the necessary chunk shape (thus allocating the required memory) - potentially catching a `MemoryError` being raised by python However, some changes were made in #218 to accommodate the fact that, running under SLURM, the job would be OOM killed by SLURM without giving python a chance to raise a `MemoryError`. Among these changes were that, instead of allocating a numpy array and then catching if a `MemoryError` was raised, the required number of bytes was calculated to determine if the numpy array should be created or not. This essentially made the determining of the store backing be an operation purely involving some simple arithmetic. The arithmetic could conceivably be done outside of the writer (ie, prior to the first block being written to the store). One benefit of this would be that the decision of the store backing wouldn't be so hidden as to where it was happening. Furthermore, the requirement of needing to account for two copies of the chunk in memory (see the following comment and the linked thread in the comment for more details #401 (comment)), was difficult to fulfil if the writer was the object deciding what the backing of the store should be (ie, getting the chunk information about the next section, to the writer of the current section, appeared to be tricky). With the above in mind, the calculation of if the store should be backed by RAM or an hdf5 file has been moved out of the writer, and is now performed by the task runner. The runner determines the backing of the store, and passes that information to the writer's constructor. The writer now simply uses whichever backing it is told to use by the runner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move store backing calculation out of dataset writer #441

Move store backing calculation out of dataset writer #441

Commits on Sep 12, 2024