Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical race condition in distributed file processing where file size validation could fail due to premature checks by worker ranks. By centralizing the validation logic on the master rank and enforcing a global synchronization barrier, the changes ensure deterministic and accurate file integrity checks, preventing erroneous exits and improving the reliability of the simulation workflow. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a potential race condition in file validation within a distributed MPI environment. By moving the validation logic to the master rank and executing it only after all worker ranks have synchronized at a barrier, the change ensures that the file size is checked deterministically after all I/O operations are complete. The removal of the previous validate_end function, a redundant print statement, and the addition of flush=True to print calls are all good improvements. I've added one suggestion to further improve the robustness of the new validation logic by handling potential OSError exceptions, which will prevent the master rank from crashing and ensure a clean shutdown of all processes in case of unexpected file system issues.
| if is_master: | ||
| logger.debug("Simulation completed.") | ||
| actual_size = os.stat(args.out_file).st_size | ||
| if actual_size != file_size: | ||
| msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}" | ||
| print(msg, flush=True) | ||
| logger.error(msg) | ||
| comm.Abort(1) | ||
| else: | ||
| logger.debug("Simulation completed and size verified.") | ||
| print("✅ HF completed successfully", flush=True) |
There was a problem hiding this comment.
For increased robustness, it's good practice to wrap the file size check in a try...except OSError block. This will gracefully handle cases where os.stat(args.out_file) might fail (e.g., if the file is unexpectedly deleted or permissions change during execution). This prevents the master rank from crashing with an unhandled exception and ensures a clean shutdown of all MPI processes via comm.Abort().
| if is_master: | |
| logger.debug("Simulation completed.") | |
| actual_size = os.stat(args.out_file).st_size | |
| if actual_size != file_size: | |
| msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}" | |
| print(msg, flush=True) | |
| logger.error(msg) | |
| comm.Abort(1) | |
| else: | |
| logger.debug("Simulation completed and size verified.") | |
| print("✅ HF completed successfully", flush=True) | |
| if is_master: | |
| try: | |
| actual_size = os.stat(args.out_file).st_size | |
| if actual_size != file_size: | |
| msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}" | |
| print(msg, flush=True) | |
| logger.error(msg) | |
| comm.Abort(1) | |
| else: | |
| logger.debug("Simulation completed and size verified.") | |
| print("✅ HF completed successfully", flush=True) | |
| except OSError as e: | |
| msg = f"CRITICAL: Could not stat output file {args.out_file}: {e}" | |
| print(msg, flush=True) | |
| logger.error(msg) | |
| comm.Abort(1) |
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition in MPI HF simulations where file-size validation could run before all ranks finished writing, by moving final validation to the master rank after an MPI barrier.
Changes:
- Removed the per-rank
validate_end()logic that could trigger early while other ranks were still writing. - Performed final output file size verification on the master rank strictly after
comm.Barrier(). - Adjusted stdout printing to flush immediately for more reliable Slurm log output.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # distribute work, must be sequential for optimisation, | ||
| # and for validation function above to be thread safe |
There was a problem hiding this comment.
The comment about needing sequential distribution "for validation function above to be thread safe" is now stale since validate_end() was removed and validation happens on the master after the barrier. Please update/remove this comment to reflect the new synchronization/validation approach (and avoid misleading future maintainers).
| # distribute work, must be sequential for optimisation, | |
| # and for validation function above to be thread safe | |
| # distribute work in a round-robin fashion across ranks for optimisation |
The ISSUE
The old code assumed that the rank processing the numerically highest station index (
work_idx[-1] == stations_todo_idx[-1]) would inherently be the last rank to finish writing to the file.In a multi-node environment, that is not guaranteed. Rank 35 might be assigned the very last station and finish it in 10 seconds. Meanwhile, Rank 2 might be assigned a complex station early in the list and take 15 seconds to finish.
Under the old logic, Rank 35 would finish, trigger
validate_end(), and potentially pass or fail the size check while Rank 2 is still actively writing data to the middle of the file.Furthermore, if a rank triggered a validation error and exited silently without properly signaling the entire communicator, the surviving ranks would wait at the MPI barrier indefinitely until the Slurm allocation ran out, causing a resource-wasting "zombie" deadlock. (wasting core hours until it hits walltime!)
Solution
validate_end()function from the worker loop.comm.Barrier()to ensure all ranks have completely finished their calculations and I/O.is_master) strictly after the barrier is lifted.flush=Trueto standard output prints to ensure timely log delivery in Slurm.outfiles.comm.Abort(1)for critical file size mismatches. IMPORTANT When paired with thesrun --quit-on-interrupt --kill-on-bad-exit=1flags in the submission wrappers, this guarantees that a validation failure acts as a cluster-wide kill switch, instantly terminating all tasks across all nodes and preventing hung jobs.This ensures deterministic file validation, prevents premature exits, and guarantees clean job failures without wasting compute allocation.