Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve intermediate file performance #271

Merged
merged 15 commits into from
Aug 12, 2024

Conversation

yousefmoazzam
Copy link
Collaborator

@yousefmoazzam yousefmoazzam commented Apr 5, 2024

How the benchmarking runs were organised

Various runs on the production SLURM cluster at DLS were performed using

  • a fixed hardware configuration
  • a single pipeline being applied to a specific dataset

and varying the parameters relevant to intermediate data saving

Hardware

The fixed hardware configuration was as follows:

  • 1 node
  • 4 GPUs per node
  • NVIDIA V100 GPUs

Pipeline

The fixed pipeline was https://github.com/dkazanc/dls_pipelines/blob/main/pipelines/bench_pipeline_gpu_intense_separate_rescale.yaml.

Data

The fixed dataset was "the sandstone data" 119647.nxs (20GB).

Parameters investigated

The parameters that were varied were:

  • intermediate file format: hdf5 or zarr
  • unchunked vs. chunked (the chunk shape was chosen as 1 projection/sinogram, depending on the pattern of the method producing the intermediate data)
  • uncompressed vs. compressed (only BLOSC compression was used)
  • running httomo with the --save-all flag or not

Times

The speeds of pipeline execution are ordered from fastest to slowest, grouped under the headings of whether the save-all flag was used or not.

The full time taken for the pipeline to execute was taken from the logfile that httomo creates.

Without --save-all

  1. zarr: chunked + uncompressed = 149s
  2. zarr: chunked + compressed = 168s
  3. hdf5: chunked + uncompressed = 243s
  4. hdf5: unchunked + uncompressed = 353s
  5. zarr: unchunked + uncompressed = 430s
  6. hdf5: chunked + compressed = N/A (segfault when saving FBP output)

With --save-all

  1. zarr: chunked + uncompressed = 275s
  2. zarr: chunked + compressed = 342s
  3. hdf5: chunked + uncompressed = 794s
  4. hdf5: unchunked + uncompressed = 1148s
  5. zarr: unchunked + uncompressed = 2233s
  6. hdf5: chunked + compressed = N/A (segfault when saving FBP output)

Concluding remarks

  • hdf5 intermediate data saving can be faster by simply writing the data chunked:
    • without --save-all: 243s (chunked+uncompressed) vs. 353s (unchunked+uncompressed) = 110s speedup (243/353 * 100 = 68.8% of original time, so ~30% decrease in time taken)
    • with --save-all: 794s (chunked+uncompressed) vs. 1148s (unchunked+uncompressed) = 354s speedup (794/1148 * 100 = 69.1% of original time, so ~30% decrease in time taken)
  • zarr chunked + uncompressed seems to be the fastest
  • hdf5 chunked + uncompressed possibly can be faster if it's investigated more, due to a block of ~70s appearing in the nsys profile report where it's unclear what the CPU is doing, but is in some sort of blocking call that possibly involves MPI synchronisation
  • hdf5 chunked + compressed consistently produces a segfault when writing data, but only for the output of FBP (the output of other methods are able to be written as hdf5 chunked + compressed), so something strange is possibly going on there and is worth extra investigation
  • this is my first time working with zarr so far, so take this with a grain of salt: writing data in the zarr file format with any kind of non-trivial hierarchical structure using parallel processes seems to not yet be fully mature, there are synchronisation issues regarding the creation of groups which can be tricky to get around as a newbie, see Syncrhonization issue when creating groups in parallel zarr-developers/zarr-python#658

@yousefmoazzam yousefmoazzam force-pushed the improve-intermediate-file-performance branch from bbee3ee to 27e9cca Compare April 8, 2024 14:25
@yousefmoazzam
Copy link
Collaborator Author

yousefmoazzam commented Apr 8, 2024

More info on hdf5 chunked + uncompressed

Viewing the nsys profile reports in nsight reveals that the ~70s "wait" before writing some blocks to intermediate data occurs firstly after the first time the paganin filter method is executed (taking ~59s). It then occurs several times in the first block being processed in the section containing FBP (taking ~70s each time).

More specifically, the stripe removal method's output being written in that section doesn't cause the wait, but

  • FBP's output causes it
  • median filter's ouptut causes it
  • calculate stats' output causes it

all only on the first block iteration within that section), see the screenshot below

hdf5-chunked-save-all-waits

Info that could help understand why this is happening

Regarding my comment about this possibly being related to MPI synchronisations occurring, this was due to seeing function calls in the region where the waiting/blocking occurs mentioning mutexes:
hdf5-chunk-save-all-message-1

Another interesting function call that seems to appear early on in every instance where the waiting/blocking occurs mentions "memcpy" and "unaligned":
hdf5-chunk-save-all-message-2

The name of the region being "vdso" also seems noteworthy. At first there was a suspicion it was referring to "VDS (virtual datasets)", a feature of hdf5. However, this feature in hdf5 isn't being used when writing data in httomo. Given that some of the functions in that region appear to be system calls, the latest speculation is that maybe it's referring to "vDSO" that is part of some binary executables https://www.man7.org/linux/man-pages/man7/vdso.7.html

Note that `--compress-intermediate` has been configured to imply
`--chunk-intermediate`. This is because if compression is applied
without also triggering the code to specify the chunk shape to be 1
chunk per projection/sinogram, then `h5py` will guess the chunk shape to
write. This is often not as optimal as choosing 1 chunk per
projection/sinogram, in terms of compression ratio.
Creation of `Group` objects via `zarr.open()` and `zarr.open_group()`
suffers a race condition regarding no synchronisation between processes
when attempting to initialise group metadata.

Therefore, `Group` creation via `open_group()` has been replaced in
favour of:
- the rank 0 process being the only one to create any groups (see
  auxiliary data creation code changes)
- passing a `DirectoryStore` object to `zarr.open_array()`, which does
  not exhibit the same synchronisation issues
@yousefmoazzam yousefmoazzam force-pushed the improve-intermediate-file-performance branch from 27e9cca to 18a3b60 Compare April 10, 2024 10:15
@dkazanc dkazanc changed the base branch from feature/transparent-file-store to main June 13, 2024 15:23
@team-gpu
Copy link
Contributor

team-gpu commented Jul 15, 2024

A few comments here (without investigating deeper yet):

  • for HDF5 chunking, I think the following should be considered:

    • The official recommendation is to keep chunks below 1MB in size. In this PR, you use one chunk per slice. Which of course makes it easier to handle, but that might be bigger than 1MB for this dataset?
    • the chunks are used for caching (also see here) and also have an effect on compression (which might explain the seg fault). But caching is not an issue for intermediate files, as we don't read them and do a pure write, once-only.
    • I believe these test runs should be extended to experiment with different HDF5 chunk sizes, as chunking definitely had a positive effect.
  • the pthread mutex waits / condition variable waits look like they are similar to this issue (although different context). The conclusion there was a CUDA driver issue - and somewhere in the thread they mention something about different streams used by different threads in the same process. I think for httomo, the former is more likely and we should try to re-run with a newer driver and confirm if things are the same / reproducible.

  • If the driver does not resolve this, we'll need to dig deeper with other profiles to find out where exactly the delay is happening, how, and why...

  • one possibility (considered due to the unaligned memcpy that seems to happen) is to explicitly manage the host memory used for the data when GPU->CPU copies happen, and make sure it's aligned and also pinned (not virtual). It is very likely faster to do file I/O from pinned memory. But this requires deeper CPU/Python profiling to confirm.

@dkazanc
Copy link
Collaborator

dkazanc commented Jul 15, 2024

Hi @ptim0626, please see the comment from @team-gpu above, could be helpful. Worth trying to reduce the size of the chunk and see if it fixes the compression-related segfault. Also an interesting point about the driver and this pthread mutex wait.
In this PR I'll also play with smaller chunk sizes as anything larger than 1 was leading to segfault, probably the chunk was too big.

@ptim0626
Copy link
Collaborator

Hi @dkazanc, thanks for the info! Actually I just think I found out the reason of segfault with compressed chunk (literally just few hours ago!). I am just doing some reading to make sure I have more understanding about it.

In short, if you export OMPI_MCA_fcoll=dynamic the segfault will be gone. What I haven't tested is the performance and the pthread issue, but I just want to make compressed chunks working for now.

@ptim0626
Copy link
Collaborator

FYI I opened a separate issue for tracking the compressed chunk segmentation fault #389

@ptim0626
Copy link
Collaborator

I think I am able to reproduce the wait.

Looking at my profile report, I guess the wait relates to the creation of chunks (a lot of H5D_chunk_allocate). In serial case, the chunk is allocated incrementally (when it is actually being written); in parallel case, the chunks are allocated when the dataset is created [1]. (If the function _save_dataset_data is commented out, the wait is still there)

I have been trying to turn off this behaviour, so far I have tried:

However, the wait is still there... I will continue to investigate in this direction. I could mix up the hdf5/h5py version or setting it in the incorrect place etc.

So useful links:
[1] Parallel I/O with HDF5 and Performance Tuning Techniques
[2] Parallel compression improvements in HDF5 1.13.1

@team-gpu
Copy link
Contributor

So the wait is only happening with chunked writes, and not if chunking isn't used?

@yousefmoazzam
Copy link
Collaborator Author

@team-gpu That's what my experience was when I came across the issue: writing hdf5 unchunked (serial or parallel) -> waiting doesn't occur, writing hdf5 chunked in parallel -> waiting occurs

@ptim0626
Copy link
Collaborator

Ok my bad I set in the wrong place...so

Preliminary profiling results to compare:

without setting anything (the default in h5py):
fill_time_alloc_annotated

with H5D_FILL_TIME_NEVER:
fill_time_never_annotated

We need to find a nicer way to override the default setting, for my own testing purpose I just change it in the source code :)

h5py seems to hard code this, even if you provide your own dataset creation property list (dcpl) via create_dataset.

@yousefmoazzam @team-gpu the above links should give a much better explanation than me about why it is only on chunks.

@yousefmoazzam
Copy link
Collaborator Author

It would be good to know why that line in h5py has the comment "prevent resize glitch" (shame there's no issue number in the code comment or something):

    if chunks is not None:
        plist.set_chunk(chunks)
        plist.set_fill_time(h5d.FILL_TIME_ALLOC)  # prevent resize glitch

Ie, what is this "resize glitch", and in what cases can this "resize glitch" happen.

Without knowing anything else apart from:

  • the fact that h5d.FILL_TIME_ALLOC seemingly has been hardcoded in h5py
  • and that code comment

it makes me wonder if bad things can happen if the value is changed to something else.

@ptim0626
Copy link
Collaborator

ptim0626 commented Jul 18, 2024

This is a good point to consider.

My guess is hdf5 dataset can have a maximum shape specified (something like maxshape?), so filling the dataset when it is initialised ensure all values in the dataset are properly defined. Let say if the maximum shape is (10,), and initially I write only 5 values to the dataset, later I write 3 more (a total of 8 now). Internally the hdf5 might do some resize operation, and when we read the data, in principle we can read all the 10 values. With the 'never fill' option, the reading will be an undefined behaviour; with the filling option, we ensure every value will be a proper value, even some memory are never written into.

That's my guess only.h5py tends to be very general and conservative. Doing the 'never fill' option looks like trading safety with speed. I will certainly do some more reading to make sure this is safe to do in httomo's context (for example, I don't think a dataset will be resized in httomo once it is created).

EDIT: just git blame it, the line was added 13 years ago!

@team-gpu
Copy link
Contributor

Here is a nice description what those flags do: https://docs.hdfgroup.org/hdf5/v1_14/_h5_d__u_g.html#subsubsec_dataset_allocation_store

So @ptim0626 understanding makes sense - the fill time settings determines what to do with the elements that have never been written. By default, HDF5 fills the content of a dataset with a fill value (zero for example) as soon as it's created (the allocation flag controls when - incremental would mean as soon as a chunk is created). This makes sure that there is no uninitialised data in the file.

Re resizing - only chunked datasets are resizable in HDF5. Otherwise the full dataset's space is allocated as soon as it is created - and filled with the fill value by default. This could be the reason why unchunked HDF5 appears slower. I would guess that it's much faster with the fill time never setting...

For the MPI driver I couldn't find docs, so I'm trying an educated guess. Dynamic allocation of chunks can be quite the bottleneck in a parallel mode - imagine 100 processes working on the same file and allocating chunks all over the dataset. Now I don't know how chunk allocation is managed in HDF5, but there would certainly be some synchronisation needed as all processes work on the same file. It could very well be that the MPI driver does some allocation up-front in any case, to avoid too much synchronisation needed during larger writes, or something like that.

Re httomo - we never resize the dataset, so that comment @yousefmoazzam doesn't affect us (whatever it means). We are also reasonably sure (by tests) that we write all values in the file eventually. From that, I'd guess that never filling is a good choice.

@team-gpu
Copy link
Contributor

Also note here that this comment was added 13 years ago and never touched after...
image

@team-gpu
Copy link
Contributor

team-gpu commented Jul 18, 2024

It can be controlled on a lower level in the h5py library... https://groups.google.com/g/h5py/c/bivqPsEmg1w/m/QWi6mtfIAwAJ

which is also what they are doing in their own tests: https://github.com/h5py/h5py/blob/44ba349e5ecd21ecbc75b1314063e96d30aec872/h5py/tests/test_dataset_getitem.py#L589

@ptim0626
Copy link
Collaborator

@team-gpu thanks for all the information!

We can use the low-level interface all the way to create the dataset, but we will miss the validation tests etc provided by the high-level create_dataset. The create_dataset method actually allows you to pass in your own dataset creation property list, but in the filter's fill_dcpl function it will just get overwritten anyway...

Either we go for the 'low-level all the way' solution and ensure sufficient validation is done, or we patch (monkey-patch?) the fill_dcpl function (this will be like copying the function from source and just switch to the 'never fill' flag, but we will miss any upstream update).

@team-gpu
Copy link
Contributor

Perhaps raise it as an issue to them and monkey-patch in the meantime? It should be possible to configure it and avoid being ignored.

@team-gpu
Copy link
Contributor

Another option may be this:

  1. create a dcpl object manually, and set the chunk and fill properties on that
  2. pass this to the create_new_dataset function, but leave chunks = None there...

Then this calls the filter's fill_dcpl function, but because chunks=None it does not apply them here. But we've already set the properties.

I think this can work without any monkey patching or low-level functions.

@ptim0626
Copy link
Collaborator

Yes this would work! We will just miss some chunk validations which are much easier to do ourselves.

I agree h5py should expose this to the users, let me raise an issue with them later. This could potentially benefit a lot of other users who use parallel write and never resize the dataset.

@ptim0626
Copy link
Collaborator

I have modified the dataset creation using dcpl following @team-gpu approach. With compression, h5py attempts to guess the chunks even None is passed (as compression requires chunked storage), so I monkey-patch the guess_chunk function when compression is applied. @yousefmoazzam if you can test to see if this is working for you that would be great.

One surprising consequence is that with Blosc compression is now slower than uncompressed dataset... I have played with some Blosc settings but no success. For the FBP results it is very compressible so the timing is comparable, but others look like not very compressible and those have some delays when saving them. I will continue to investigate.

Btw my first commit to httomo so feel free to let me know any rule I should have followed :)

@ptim0626
Copy link
Collaborator

A follow-up to the issue about the slow-down observed for compression.

It looks like Blosc compression is doing its job correctly, the slow-down relates to the datasets that are being compressed, i.e. how compressible the dataset is. The time taken to compress the data must be worth, that means the compressed data should have a much smaller size so the I/O latency can be reduced considerably.

For the FBP result, the compression ratio is about 250, so we can see a speed-up:
compressed_fbp
compressed (FBP): ~3.7 s

uncompressed_fbp
uncompressed (FBP): ~9.2 s

However for other intermediate steps, those dataset have only 1.2 - 1.3 compression ratio, so the I/O latency isn't reduced much, and we spend extra time on compression, so the overall time appears to be slower:
compressed_save_all
compressed (stripe removal): ~9.7 s

uncompressed_save_all
uncompressed (stripe removal): ~3.7 s

(this can be observed for other intermediate saving, as you can see from the larger gaps, although you are not at the same scale)

The FBP result definitely worth to compress, it goes down from ~55 Gib to ~210 MiB!! But compression on other intermediate data may not be worth, there is neither speed nor storage advantages in doing it.

@team-gpu
Copy link
Contributor

Thanks for that - that makes a lot of sense.

In general, since lossless compression benefits from redundancy in the data, and we have binary floating point data (where all states for the bytes are possible), it makes sense that less processed, less smooth data gets smaller compression ratios. A fully random floating point array would likely get not much more than 1.0 compression ratio. And the further we progress in the pipeline, the further we "smooth" the data, i.e. extract patterns and remove noise. I'm guessing the FBP result has a lot of zeros, or similar values in the neighborhood?

For the reconstruction result, it probably even makes sense to apply lossy compression - i.e. to manually convert to 8bit integer data as we do for images and scale the data accordingly. There's that rescale_to_int method for that purpose. That would have the biggest effect on size (factor 4 even when it's raw data). Storing this into hdf5 with compression will likely make it nice and small...

@dkazanc
Copy link
Collaborator

dkazanc commented Jul 22, 2024

Thanks a lot @ptim0626 for this work and @team-gpu for the valuable comments. The FBP compression from ~55 Gib to ~210 MiB is impressive.
Few follow-up notes:

  • This is very useful as in 90% cases the users would need to save only the result of the reconstruction. The other intermediate data saving would normally needed for the debugging purposes only. If we keep doing re-slice through the memory, then we certainly gain here. I guess we could make the compression enabled in our library file to compliment save_result? It would be nice to have a control over it from the methods standpoint? I would also enable compression for the denoising methods that we're about to add or the existing ones, so that we have something like:
-     median_filter:
      pattern: all
      output_dims_change: False
      implementation: gpu_cupy
      save_result_default: False
      compression: True
  • If the mask is applied in the reconstruction step then there will a lot of zeros, so that helps with the compression as well. Something that users do often.
  • The denoising methods will be applied routinely to the result of FBP, so we need to save the result of them compressed. The iterative methods will be even more compressible as some of them will have a very small dynamic range.
  • Saving projection data uncompressed is totally fine, as @team-gpu mentioned this kind of data wouldn't gain from it.

Out of curiosity, if the compressed reconstruction is loaded in Dawn, how much time it takes to slice through it? There was a related issue about chunk size should be small, I guess we need to sort that out first before trying?

Thanks!

@dkazanc
Copy link
Collaborator

dkazanc commented Jul 22, 2024

Sorry @ptim0626 I probably have missed something, but how we ended up from 2 minutes of saving time the result of the reconstruction to few seconds? :) You might have seen this weird behaviour and it also reported by @yousefmoazzam above?

save_intermediate_20_40_80gb

@ptim0626
Copy link
Collaborator

@dkazanc I actually haven't tried opening any FBP results (I just h5dump/h5ls to check if the file is ok), and it turns out all my data is NaN, uncompressed or compressed. I was using the pipeline above, and commented out everything after media_filter. Am I doing something wrong? I was using the 119647 dataset for all my tests so far.

The chunk cache is for reading, and it depends on the access pattern. I think it makes sense to have a chunk cache at least the size of the uncompressed single chunk so looking at values inside a chunk is quick. I remember the maximum chunk size is ~4GiB, so as long as the chunk isn't too big it shouldn't be a problem (the chunk cache should be adjusted accordingly if the chunk is bigger than 1 MiB)

@dkazanc
Copy link
Collaborator

dkazanc commented Jul 23, 2024

Thanks @ptim0626, it looks like this branch has not been updated to main for quite some time! I'm trying to resolve the conflicts to bring the latest changes in and then I'll give a go with it. cheers

httomo/methods.py Outdated Show resolved Hide resolved
@team-gpu
Copy link
Contributor

team-gpu commented Aug 2, 2024

Hi. Adding my 2 cents for the code itself:

  • the methods got really long - for example I'd have 2 methods for zarr and hdf5 and call that from the main one, I'd also factor out the dcpl creation into its own method, etc.
  • there are no tests for Zarr, compression, etc - if you factor out the methods above, you can unit test each on its own, confirm that monkey patching works, compression is setup, chunk size, etc.
  • maybe a performance test would be good?

@dkazanc
Copy link
Collaborator

dkazanc commented Aug 2, 2024

thanks @team-gpu . I guess I'll branch out the zarr addition for now as it will require more work for testing etc. And yes, we probably need few more tests here for compression at least.

@dkazanc dkazanc merged commit da8f579 into main Aug 12, 2024
2 of 3 checks passed
@dkazanc dkazanc deleted the improve-intermediate-file-performance branch August 12, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants