OpenMPI setting for saving compressed hdf5 chunks #389

ptim0626 · 2024-07-16T09:59:26Z

Saving Blosc-compressed chunks to an hdf5 dataset by OpenMPI would result in segmentation fault, below is a minimal script that illustrates the issue:

import numpy as np
import h5py
import blosc
from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

def prepare(ch, y, x):
    start = (ch * rank, 0, 0)
    stop = (ch * (rank+1), y, x)

    # use random data
    rng = np.random.default_rng(rank * 123)
    data = rng.random((ch, y, x), dtype=np.float32)

    return start, stop, data

def main():
    fp = "/dls/tmp/..../compressed_out.hdf5"

    # nframes should be divisible by rank size
    nframes = 1200
    y = 2160
    x = 2560
    global_shape = (nframes, y, x)
    ch = nframes // size

    # just some useful information
    if rank == 0:
        tot_sz = nframes * y * x * 4 / 2**30
        print(f"full data shape: ({nframes}, {y}, {x})")
        print(f"full data size (uncompressed): {tot_sz:.3f} Gb")
        print(f"each block size (uncompressed): {tot_sz / size:.3f} Gb")
        print(f"Blosc max buffer size: {float(blosc.BLOSC_MAX_BUFFERSIZE) / 2**30:.3f} Gb")

    start, stop, data = prepare(ch, y, x)

    with h5py.File(fp, "w", driver="mpio", comm=MPI.COMM_WORLD) as f:

        dset = f.create_dataset("/data",
                                shape=global_shape,
                                dtype=data.dtype,
                                chunks=(1, y, x),
                                compression=32001,
                                compression_opts=(0,0,0,0,5,1,1),
                                )

        comm.Barrier()

        with dset.collective:
            dset[start[0]:stop[0], start[1]:stop[1], start[2]:stop[2]] = data

        comm.Barrier()

if __name__ == "__main__":
    main()

The above script produces some random data from each MPI rank, and saves it to the hdf5 file with Blosc compression (32001 is registered filter ID of Blosc, 5 is the compression level, second-last 1 for using shuffling, last 1 for using lz4 compressor). If the above script is run in the cluster with the following submission script:

#!/usr/bin/env bash
#SBATCH --partition=cs05r
#SBATCH --job-name=mpi-compressed
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 4
#SBATCH --cpus-per-task=1
#SBATCH --mem 100G
#SBATCH --time 30

# load your environment (openmpi as well)
# ....

# check using the expected MPI
type -p mpicc

srun python save_compressed_by_mpi.py

It will result in a segmentation fault and you will see something like

[cs05r-sc-com04-02:2060087:0:2060087] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x55e25d396018)                                                                                                                                                                                                                   
BFD: Dwarf Error: Can't find .debug_ranges section.                                                                                                                                                                                                                                                                                                   
BFD: Dwarf Error: Can't find .debug_ranges section.                                                                                                                                                                                                                                                                                                   
....                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
==== backtrace (tid:2060087) ====                                                                                                                                                                                                                                                                                                                     
 0 0x0000000000012cf0 __funlockfile()  :0                                                                                                                                                                                                                                                                                                             
 1 0x0000000000003e38 shuffle_init.isra.2()  fcoll_vulcan_file_write_all.c:0                                                                                                                                                                                                                                                                          
 2 0x0000000000006bdd mca_fcoll_vulcan_file_write_all()  ???:0                                                                                                                                                                                                                                                                                        
 3 0x00000000000097f9 mca_common_ompio_file_write_at_all()  ???:0                                                                                                                                                                                                                                                                                     
 4 0x0000000000005bf6 mca_io_ompio_file_write_at_all()  ???:0                                                                                                                                                                                                                                                                                         
 5 0x000000000007b338 PMPI_File_write_at_all()  ???:0                                                                                                                                                                                                                                                                                                 
 6 0x000000000033321d H5FD__mpio_write()  H5FDmpio.c:0                                                                                                                                                                                                                                                                                                
 7 0x0000000000132cd9 H5FD_write()  ???:0                                                                                                                                                                                                                                                                                                             
 8 0x000000000010eebd H5F__accum_write()  ???:0                                                                                                                                                                                                                                                                                                       
 9 0x000000000021da14 H5PB_write()  ???:0                                                                                                                                                                                                                                                                                                             
10 0x0000000000119fb1 H5F_shared_block_write()  ???:0                                                                                                                                                                                                                                                                                                 
11 0x00000000003303f3 H5D__mpio_select_write()  ???:0                                                                                                                                                                                                                                                                                                 
12 0x0000000000324973 H5D__final_collective_io()  H5Dmpio.c:0                                                                                                                                                                                                                                                                                         
13 0x000000000032f60e H5D__piece_io()  H5Dmpio.c:0                                                                                                                                                                                                                                                                                                    
14 0x00000000003304b9 H5D__collective_write()  ???:0                                                                                                                                                                                                                                                                                                  
15 0x00000000000dbf83 H5D__write()  ???:0                                                                                                                                                                                                                                                                                                             
16 0x00000000002fc356 H5VL__native_dataset_write()  ???:0                                                                                                                                                                                                                                                                                             
17 0x00000000002e4528 H5VL__dataset_write()  H5VLcallback.c:0
...
[cs05r-sc-com04-02:2060087] *** Process received signal ***                                                                                                                                                                                                                                                                                           
[cs05r-sc-com04-02:2060087] Signal: Segmentation fault (11)                                                                                                                                                                                                                                                                                           
[cs05r-sc-com04-02:2060087] Signal code:  (-6)                                                                                                                                                                                                                                                                                                        
[cs05r-sc-com04-02:2060087] Failing at address: 0x1291ba001f6f37                                                                                                                                                                                                                                                                                      
[cs05r-sc-com04-02:2060087] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f55729b1cf0]                                                                                                                                                                                                                                                                     
[cs05r-sc-com04-02:2060087] [ 1] /dls_sw/apps/openmpi/4.1.5/rhel8/x86_64/lib/openmpi/mca_fcoll_vulcan.so(+0x3e38)[0x7f554023be38]                                                                                                                                                                                                                     
[cs05r-sc-com04-02:2060087] [ 2] /dls_sw/apps/openmpi/4.1.5/rhel8/x86_64/lib/openmpi/mca_fcoll_vulcan.so(mca_fcoll_vulcan_file_write_all+0x114d)[0x7f554023ebdd]                                                                                                                                                                                      
[cs05r-sc-com04-02:2060087] [ 3] /dls_sw/apps/openmpi/4.1.5/rhel8/x86_64/lib/libmca_common_ompio.so.41(mca_common_ompio_file_write_at_all+0x49)[0x7f554b9a37f9]                                                                                                                                                                                       
[cs05r-sc-com04-02:2060087] [ 4] /dls_sw/apps/openmpi/4.1.5/rhel8/x86_64/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_write_at_all+0x26)[0x7f554bbadbf6]                                                                                                                                                                                             
[cs05r-sc-com04-02:2060087] [ 5] /dls_sw/apps/openmpi/4.1.5/rhel8/x86_64/lib/libmpi.so.40(PMPI_File_write_at_all+0xc8)[0x7f5561c32338] 
...

My initial suspicion is the data block is larger than the maximum buffer size in Blosc (around 2 Gb, see here), as using nframes = 300 (each data block will be (300/4)*2160*2560*4 / 2**30 ~ 1.5 G) will work. But upon more investigation, if you use 2 MPI processes (#SBATCH --ntasks-per-node 2), the above script will work regardless of how large is the data block. Only 2 but not more! So it looks like the Blosc maximum buffer is irrelevant (as I guess h5py will handle this internally, i.e. dividing data block by the maximum buffer size when doing compression)

Looking at the error log, it suggests something nasty was happening at some calls with fcoll_vulcan. From the OpenMPI docs, vulcan is a component in the framework fcoll for MPI I/O operation. Changing the option to dynamic makes the segmentation fault disappears.

...
export OMPI_MCA_fcoll=dynamic

srun python save_compressed_by_mpi.py

This works regardless of the number of MPI processes. Using dynamic_gen2 seems to be slower, at least for me, partly because of using OpenMPI 4.1.x.

As suggested in this issue, using export OMPI_MCA_io=^ompio also works.

I suggest when a centralised httomo environment is built, we can use export OMPI_MCA_fcoll=dynamic as a workaround for now. From the above issue, this problem might be resolved in OpenMPI 5.x so this could just be relevant for 4.x.

The text was updated successfully, but these errors were encountered:

ptim0626 · 2024-07-17T08:47:14Z

This issue could be related.

ptim0626 mentioned this issue Jul 16, 2024

Improve intermediate file performance #271

Merged

ptim0626 added the documentation Improvements or additions to documentation label Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMPI setting for saving compressed hdf5 chunks #389

OpenMPI setting for saving compressed hdf5 chunks #389

ptim0626 commented Jul 16, 2024

ptim0626 commented Jul 17, 2024

OpenMPI setting for saving compressed hdf5 chunks #389

OpenMPI setting for saving compressed hdf5 chunks #389

Comments

ptim0626 commented Jul 16, 2024

ptim0626 commented Jul 17, 2024