diff --git a/docs/source/usage/plugins/openPMD.rst b/docs/source/usage/plugins/openPMD.rst
index 15e1deb43b..0e1617c549 100644
--- a/docs/source/usage/plugins/openPMD.rst
+++ b/docs/source/usage/plugins/openPMD.rst
@@ -113,7 +113,7 @@ PIConGPU command line option          description
 ===================================== ====================================================================================================================================================
 ``--openPMD.period``                  Period after which simulation data should be stored on disk.
 ``--openPMD.source``                  Select data sources and filters to dump. Default is ``species_all,fields_all``, which dumps all fields and particle species.
-``--openPMD.file``                    Relative or absolute openPMD file prefix for simulation data. If relative, files are stored under ``simOutput``. 
+``--openPMD.file``                    Relative or absolute openPMD file prefix for simulation data. If relative, files are stored under ``simOutput``.
 ``--openPMD.ext``                     openPMD filename extension (this controls thebackend picked by the openPMD API).
 ``--openPMD.infix``                   openPMD filename infix (use to pick file- or group-based layout in openPMD). Set to NULL to keep empty (e.g. to pick group-based iteration layout).
 ``--openPMD.json``                    Set backend-specific parameters for openPMD backends in JSON format.
@@ -122,14 +122,14 @@ PIConGPU command line option          description
 
 .. note::
 
-   This plugin is a multi plugin. 
+   This plugin is a multi plugin.
    Command line parameter can be used multiple times to create e.g. dumps with different dumping period.
    In the case where an optional parameter with a default value is explicitly defined, the parameter will always be passed to the instance of the multi plugin where the parameter is not set.
    e.g.
 
    .. code-block:: bash
 
-      --openPMD.period 128 --openPMD.file simData1 --openPMD.source 'species_all' 
+      --openPMD.period 128 --openPMD.file simData1 --openPMD.source 'species_all'
       --openPMD.period 1000 --openPMD.file simData2 --openPMD.source 'fields_all' --openPMD.ext h5
 
    creates two plugins:
@@ -137,6 +137,36 @@ PIConGPU command line option          description
    #. dump all species data each 128th time step, use HDF5 backend.
    #. dump all field data each 1000th time step, use the default ADIOS backend.
 
+Backend-specific notes
+^^^^^^^^^^^^^^^^^^^^^^
+
+HDF5
+====
+
+In order to avoid a performance bug for parallel HDF5 on the ORNL Summit compute system, a specific version of ROMIO should be chosen and performance hints should be passed:
+
+.. code-block:: bash
+
+  > export OMPI_MCA_io=romio321
+  > export ROMIO_HINTS=./my_romio_hints
+  > cat << EOF > ./my_romio_hints
+  romio_cb_write enable
+  romio_ds_write enable
+  cb_buffer_size 16777216
+  cb_nodes <number_of_nodes>
+  EOF
+
+Replace ``<number_of_nodes>`` with the number of nodes that your job uses.
+These settings are applied automatically in the Summit templates found in ``etc/picongpu/summit-ornl``.
+For further information, see the `official Summit documentation <https://docs.olcf.ornl.gov/systems/summit_user_guide.html#slow-performance-using-parallel-hdf5-resolved-march-12-2019>`_ and `this pull request for WarpX <https://github.com/ECP-WarpX/WarpX/pull/2495>`_.
+
+
+Performance
+^^^^^^^^^^^
+
+On the Summit compute system, specifying ``export IBM_largeblock_io=true`` disables data shipping, which leads to reduced overhead for large block write operations.
+This setting is applied in the Summit templates found in ``etc/picongpu/summit-ornl``.
+
 Memory Complexity
 ^^^^^^^^^^^^^^^^^
 
@@ -192,4 +222,4 @@ Notes on the implementation of a proper template file:
 * Most batch systems will forward all resource allocations of a batch script to launched parallel processes inside the batch script.
   When launching several processes asynchronously, resources must be allocated explicitly.
   This includes GPUs, CPU cores and often memory.
-* This setup is currently impossible to implement on the HZDR Hemera cluster due to a wrong configuration of the Batch system.
\ No newline at end of file
+* This setup is currently impossible to implement on the HZDR Hemera cluster due to a wrong configuration of the Batch system.
diff --git a/etc/picongpu/summit-ornl/gpu_batch.tpl b/etc/picongpu/summit-ornl/gpu_batch.tpl
index 8ac5557316..b118269b5d 100644
--- a/etc/picongpu/summit-ornl/gpu_batch.tpl
+++ b/etc/picongpu/summit-ornl/gpu_batch.tpl
@@ -88,6 +88,22 @@ export OMPI_MCA_coll_ibm_skip_barrier=true
 
 #jsrun  -N 1 -n !TBG_nodes !TBG_dstPath/input/bin/cuda_memtest.sh
 
+# I/O tuning inspired from WarpX, see https://github.com/ECP-WarpX/WarpX/pull/2495
+# ROMIO has a hint for GPFS named IBM_largeblock_io which optimizes I/O with operations on large blocks
+export IBM_largeblock_io=true
+
+# MPI-I/O: ROMIO hints for parallel HDF5 performance
+export OMPI_MCA_io=romio321
+export ROMIO_HINTS=./romio-hints
+#   number of hosts: unique node names minus batch node
+NUM_HOSTS=$(( $(echo $LSB_HOSTS | tr ' ' '\n' | uniq | wc -l) - 1 ))
+cat > romio-hints << EOL
+romio_cb_write enable
+romio_ds_write enable
+cb_buffer_size 16777216
+cb_nodes ${NUM_HOSTS}
+EOL
+
 #if [ $? -eq 0 ] ; then
 export OMP_NUM_THREADS=!TBG_coresPerGPU
 jsrun --nrs !TBG_tasks --tasks_per_rs 1 --cpu_per_rs !TBG_coresPerGPU --gpu_per_rs 1 --latency_priority GPU-CPU --bind rs --smpiargs="-gpu" !TBG_dstPath/input/bin/picongpu --mpiDirect !TBG_author !TBG_programParams | tee output
diff --git a/etc/picongpu/summit-ornl/gpu_batch_pipe.tpl b/etc/picongpu/summit-ornl/gpu_batch_pipe.tpl
index 4706c8b7a5..9a757ee8e4 100644
--- a/etc/picongpu/summit-ornl/gpu_batch_pipe.tpl
+++ b/etc/picongpu/summit-ornl/gpu_batch_pipe.tpl
@@ -110,6 +110,22 @@ export OMP_NUM_THREADS=!TBG_coresPerGPU
 # strategies that keep communication in one node.
 export OPENPMD_CHUNK_DISTRIBUTION=hostname_binpacking_binpacking
 
+# I/O tuning inspired from WarpX, see https://github.com/ECP-WarpX/WarpX/pull/2495
+# ROMIO has a hint for GPFS named IBM_largeblock_io which optimizes I/O with operations on large blocks
+export IBM_largeblock_io=true
+
+# MPI-I/O: ROMIO hints for parallel HDF5 performance
+export OMPI_MCA_io=romio321
+export ROMIO_HINTS=./romio-hints
+#   number of hosts: unique node names minus batch node
+NUM_HOSTS=$(( $(echo $LSB_HOSTS | tr ' ' '\n' | uniq | wc -l) - 1 ))
+cat > romio-hints << EOL
+romio_cb_write enable
+romio_ds_write enable
+cb_buffer_size 16777216
+cb_nodes ${NUM_HOSTS}
+EOL
+
 # export LD_PROFILE_OUTPUT=`pwd`
 # export LD_PROFILE=libadios2_evpath.so
 jsrun --nrs !TBG_tasks --tasks_per_rs 1 --cpu_per_rs !TBG_coresPerGPU --gpu_per_rs 1 --latency_priority GPU-CPU --bind rs --smpiargs="-gpu" !TBG_dstPath/input/bin/picongpu !TBG_author !TBG_programParams | tee ../output &