Skip to content

Let master handle validate_end()#539

Open
sungeunbae wants to merge 2 commits intomasterfrom
hf_sim_fix
Open

Let master handle validate_end()#539
sungeunbae wants to merge 2 commits intomasterfrom
hf_sim_fix

Conversation

@sungeunbae
Copy link
Member

@sungeunbae sungeunbae commented Mar 1, 2026

The ISSUE
The old code assumed that the rank processing the numerically highest station index (work_idx[-1] == stations_todo_idx[-1]) would inherently be the last rank to finish writing to the file.

In a multi-node environment, that is not guaranteed. Rank 35 might be assigned the very last station and finish it in 10 seconds. Meanwhile, Rank 2 might be assigned a complex station early in the list and take 15 seconds to finish.
Under the old logic, Rank 35 would finish, trigger validate_end(), and potentially pass or fail the size check while Rank 2 is still actively writing data to the middle of the file.

Furthermore, if a rank triggered a validation error and exited silently without properly signaling the entire communicator, the surviving ranks would wait at the MPI barrier indefinitely until the Slurm allocation ran out, causing a resource-wasting "zombie" deadlock. (wasting core hours until it hits walltime!)

Solution

  1. Removed the isolated validate_end() function from the worker loop.
  2. Relied on the existing comm.Barrier() to ensure all ranks have completely finished their calculations and I/O.
  3. Delegated the final file size validation to the Master Rank (is_master) strictly after the barrier is lifted.
  4. Added flush=True to standard output prints to ensure timely log delivery in Slurm .out files.
  5. Implemented comm.Abort(1) for critical file size mismatches. IMPORTANT When paired with the srun --quit-on-interrupt --kill-on-bad-exit=1 flags in the submission wrappers, this guarantees that a validation failure acts as a cluster-wide kill switch, instantly terminating all tasks across all nodes and preventing hung jobs.

This ensures deterministic file validation, prevents premature exits, and guarantees clean job failures without wasting compute allocation.

Copilot AI review requested due to automatic review settings March 1, 2026 21:04
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical race condition in distributed file processing where file size validation could fail due to premature checks by worker ranks. By centralizing the validation logic on the master rank and enforcing a global synchronization barrier, the changes ensure deterministic and accurate file integrity checks, preventing erroneous exits and improving the reliability of the simulation workflow.

Highlights

  • Premature File Validation Fix: Resolved an issue where file size validation could occur prematurely in multi-node environments, leading to incorrect pass/fail results, by ensuring validation only happens after all processes have completed their I/O.
  • Centralized Validation Logic: Moved the file size validation logic from individual worker ranks to the master rank, ensuring a single, authoritative check after all distributed work is finalized.
  • Enhanced Synchronization: Leveraged comm.Barrier() to guarantee all ranks have finished their calculations and I/O before the master rank performs the final file size validation.
  • Improved Logging: Added flush=True to standard output print statements to ensure timely delivery of log messages, especially in environments like Slurm.
Changelog
  • workflow/calculation/hf_sim.py
    • Removed the validate_end function, which was previously responsible for verifying file size based on the last completed station.
    • Eliminated the conditional call to validate_end by worker ranks, which was the source of the premature validation issue.
    • Implemented new file size validation logic within the master rank's post-barrier section, comparing the actual file size against an expected size.
    • Added flush=True to print statements for immediate output delivery.
    • Removed a redundant process completion print statement.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a potential race condition in file validation within a distributed MPI environment. By moving the validation logic to the master rank and executing it only after all worker ranks have synchronized at a barrier, the change ensures that the file size is checked deterministically after all I/O operations are complete. The removal of the previous validate_end function, a redundant print statement, and the addition of flush=True to print calls are all good improvements. I've added one suggestion to further improve the robustness of the new validation logic by handling potential OSError exceptions, which will prevent the master rank from crashing and ensure a clean shutdown of all processes in case of unexpected file system issues.

Comment on lines 549 to +558
if is_master:
logger.debug("Simulation completed.")
actual_size = os.stat(args.out_file).st_size
if actual_size != file_size:
msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}"
print(msg, flush=True)
logger.error(msg)
comm.Abort(1)
else:
logger.debug("Simulation completed and size verified.")
print("✅ HF completed successfully", flush=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For increased robustness, it's good practice to wrap the file size check in a try...except OSError block. This will gracefully handle cases where os.stat(args.out_file) might fail (e.g., if the file is unexpectedly deleted or permissions change during execution). This prevents the master rank from crashing with an unhandled exception and ensures a clean shutdown of all MPI processes via comm.Abort().

Suggested change
if is_master:
logger.debug("Simulation completed.")
actual_size = os.stat(args.out_file).st_size
if actual_size != file_size:
msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}"
print(msg, flush=True)
logger.error(msg)
comm.Abort(1)
else:
logger.debug("Simulation completed and size verified.")
print("✅ HF completed successfully", flush=True)
if is_master:
try:
actual_size = os.stat(args.out_file).st_size
if actual_size != file_size:
msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}"
print(msg, flush=True)
logger.error(msg)
comm.Abort(1)
else:
logger.debug("Simulation completed and size verified.")
print("✅ HF completed successfully", flush=True)
except OSError as e:
msg = f"CRITICAL: Could not stat output file {args.out_file}: {e}"
print(msg, flush=True)
logger.error(msg)
comm.Abort(1)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in MPI HF simulations where file-size validation could run before all ranks finished writing, by moving final validation to the master rank after an MPI barrier.

Changes:

  • Removed the per-rank validate_end() logic that could trigger early while other ranks were still writing.
  • Performed final output file size verification on the master rank strictly after comm.Barrier().
  • Adjusted stdout printing to flush immediately for more reliable Slurm log output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 517 to 518
# distribute work, must be sequential for optimisation,
# and for validation function above to be thread safe
Copy link

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about needing sequential distribution "for validation function above to be thread safe" is now stale since validate_end() was removed and validation happens on the master after the barrier. Please update/remove this comment to reflect the new synchronization/validation approach (and avoid misleading future maintainers).

Suggested change
# distribute work, must be sequential for optimisation,
# and for validation function above to be thread safe
# distribute work in a round-robin fashion across ranks for optimisation

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants