-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI collective I/O and UnifyFS #781
Comments
I happen to have this information since my current paper talks about MPI consistency model. The MPI standard provides three levels of consistency:
So here we should be only worrying about the third case. In this case, MPI requires a sync-barrier-sync construct between the conflicting writes (from different processes). The construct can be one of the following:
|
Thanks @wangvsa . So then the app should have the sync-barrier-sync for the first situation above (1) - two different procs, but it's not required in (2) - same proc. I'm guessing most apps don't have it in either case, and UnifyFS might actually need it for both to work properly. |
The apps themself rarely overwrite the same offsite (they rarely perform two collective calls on the same range). |
Right, hopefully it's not too common, and based on your earlier research we have some confidence in that. A few of the PnetCDF tests I've been running do encounter this kind of condition. The fill call here conflicts with the put (write) calls later in the program: The test case reports data corruption under UnifyFS, because on read back, it finds the fill value rather than the expected data. When running with 2 processes, one process writes the fill data and the other writes the actual data. The fill call here doesn't specify any kind of offset, so in this case, we could argue the PnetCDF user probably should call However, this got me thinking about potential problems with MPI collective I/O more generally. Edit: Actually, on closer inspection, only rank 0 issues put (write) calls in this particular test case. I think the actual problem is that ranks try to read from the file before any earlier writes have been sync'd. The file should have been closed or sync'd before trying to read back data, I think even by PnetCDF semantics. So perhaps this test case is not really valid. |
A second example from PnetCDF is the Later put calls conflict with that fill operation, and the test reports data corruption when using 2 ranks. A workaround is to call |
While I'm at it, here are two other test cases I've found so far: fill calls conflict with later puts: implicit fill during enddef and later explicit fill call conflict with later put calls: |
According to the pnetcdf document, "PnetCDF follows the same parallel I/O data consistency as MPI-IO standard". If this is the case, they should either set the atomic mode when opening an MPI File, or put enough sync-barrier-sync. Otherwise, I would argue they have consistency issues in their implementation, not just their test cases are invalid. |
The default mode of PnetCDF intentionally does not call
I did find this in the release notes for v1.2.0: https://parallel-netcdf.github.io/wiki/NewsArchive.html
Setting This does not directly apply, but I'll just stash this URL about |
I opened a PR for a discussion with the PnetCDF team about calling |
@adammoody I'm trying to reproduce these conflicts. Which system and MPI implementation were you using? |
I did most of the work on quartz, which uses MVAPICH2 as a system MPI library. Actually, I was using a debug build of MVAPICH so that I could trace into the MPI code. I'll send you the steps in an email on how I set things up. |
I just tried ivarn and tst_def_var_fill using OpenMPI and mpich. They don't show any conflict on my side, all I/O calls are done internally using MPI_File_write_at_all (eventually only rank 0 does the pwrite()). |
With the collective write calls in MPI I/O, the MPI library may rearrange data among processes to write to the underlying file more efficiently, as is done in ROMIO's collective buffering. The user does not know which process actually writes to the file, even if they know which process provides the source data and file offset to be written.
An application may be written such that a given process writes twice to the same file offset using collective write calls. Since the same process writes to the same offset, the MPI standard does not require the application to call
MPI_File_sync()
between those writes. However, depending on the MPI implementation, those actual writes may happen from two different processes.As an example taken from PnetCDF, it is common to set default values for variables in a file using fill calls and then later write actual data to those variables. The fill calls use collective I/O, whereas the later write call may not. In this case, two different processes can write to the same file offset, one process with the fill value, and a second process with the actual data. In UnifyFS, these two writes need to be separated with a sync-barrier-sync to establish an order between them.
It may be necessary to ask users to do at least one of the following:
UNIFYFS_CLIENT_WRITE_SYNC=1
if using collective write calls (one might still need a barrier after all syncs)MPI_File_sync() + MPI_Barrier()
after any collective write callNeed to review the MPI standard:
MPI_File_sync
in the case that the application knowingly writes to the same file offset from two different ranks using two collective write calls. IsMPI_File_sync
needed in between or not?MPI_File_sync
is not required when the same process writes to the same offset in two different write calls.Regardless, I suspect very few applications currently call
MPI_File_sync
in either situation. Even if the standard requires it, we need to call this out.The UnifyFS-enabled ROMIO could sync extents and then call barrier on its collective write calls. This would ensure all writes are visible upon returning from the collective write.
The text was updated successfully, but these errors were encountered: