Let master handle validate_end() by sungeunbae · Pull Request #539 · ucgmsim/slurm_gm_workflow

sungeunbae · 2026-03-01T21:04:46Z

The ISSUE
The old code assumed that the rank processing the numerically highest station index (work_idx[-1] == stations_todo_idx[-1]) would inherently be the last rank to finish writing to the file.

In a multi-node environment, that is not guaranteed. Rank 35 might be assigned the very last station and finish it in 10 seconds. Meanwhile, Rank 2 might be assigned a complex station early in the list and take 15 seconds to finish.
Under the old logic, Rank 35 would finish, trigger validate_end(), and potentially pass or fail the size check while Rank 2 is still actively writing data to the middle of the file.

Furthermore, if a rank triggered a validation error and exited silently without properly signaling the entire communicator, the surviving ranks would wait at the MPI barrier indefinitely until the Slurm allocation ran out, causing a resource-wasting "zombie" deadlock. (wasting core hours until it hits walltime!)

Solution

Removed the isolated validate_end() function from the worker loop.
Relied on the existing comm.Barrier() to ensure all ranks have completely finished their calculations and I/O.
Delegated the final file size validation to the Master Rank (is_master) strictly after the barrier is lifted.
Added flush=True to standard output prints to ensure timely log delivery in Slurm .out files.
Implemented comm.Abort(1) for critical file size mismatches. IMPORTANT When paired with the srun --quit-on-interrupt --kill-on-bad-exit=1 flags in the submission wrappers, this guarantees that a validation failure acts as a cluster-wide kill switch, instantly terminating all tasks across all nodes and preventing hung jobs.

This ensures deterministic file validation, prevents premature exits, and guarantees clean job failures without wasting compute allocation.

gemini-code-assist · 2026-03-01T21:04:58Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical race condition in distributed file processing where file size validation could fail due to premature checks by worker ranks. By centralizing the validation logic on the master rank and enforcing a global synchronization barrier, the changes ensure deterministic and accurate file integrity checks, preventing erroneous exits and improving the reliability of the simulation workflow.

Highlights

Premature File Validation Fix: Resolved an issue where file size validation could occur prematurely in multi-node environments, leading to incorrect pass/fail results, by ensuring validation only happens after all processes have completed their I/O.
Centralized Validation Logic: Moved the file size validation logic from individual worker ranks to the master rank, ensuring a single, authoritative check after all distributed work is finalized.
Enhanced Synchronization: Leveraged comm.Barrier() to guarantee all ranks have finished their calculations and I/O before the master rank performs the final file size validation.
Improved Logging: Added flush=True to standard output print statements to ensure timely delivery of log messages, especially in environments like Slurm.

Changelog

workflow/calculation/hf_sim.py
- Removed the validate_end function, which was previously responsible for verifying file size based on the last completed station.
- Eliminated the conditional call to validate_end by worker ranks, which was the source of the premature validation issue.
- Implemented new file size validation logic within the master rank's post-barrier section, comparing the actual file size against an expected size.
- Added flush=True to print statements for immediate output delivery.
- Removed a redundant process completion print statement.

Activity

No human activity has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly addresses a potential race condition in file validation within a distributed MPI environment. By moving the validation logic to the master rank and executing it only after all worker ranks have synchronized at a barrier, the change ensures that the file size is checked deterministically after all I/O operations are complete. The removal of the previous validate_end function, a redundant print statement, and the addition of flush=True to print calls are all good improvements. I've added one suggestion to further improve the robustness of the new validation logic by handling potential OSError exceptions, which will prevent the master rank from crashing and ensure a clean shutdown of all processes in case of unexpected file system issues.

gemini-code-assist · 2026-03-01T21:06:09Z

workflow/calculation/hf_sim.py

    if is_master:
-        logger.debug("Simulation completed.")
+        actual_size = os.stat(args.out_file).st_size
+        if actual_size != file_size:
+            msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}"
+            print(msg, flush=True)
+            logger.error(msg)
+            comm.Abort(1)
+        else:
+            logger.debug("Simulation completed and size verified.")
+            print("✅ HF completed successfully", flush=True)


For increased robustness, it's good practice to wrap the file size check in a try...except OSError block. This will gracefully handle cases where os.stat(args.out_file) might fail (e.g., if the file is unexpectedly deleted or permissions change during execution). This prevents the master rank from crashing with an unhandled exception and ensures a clean shutdown of all MPI processes via comm.Abort().

Suggested change

if is_master:

logger.debug("Simulation completed.")

actual_size = os.stat(args.out_file).st_size

if actual_size != file_size:

msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}"

print(msg, flush=True)

logger.error(msg)

comm.Abort(1)

else:

logger.debug("Simulation completed and size verified.")

print("✅ HF completed successfully", flush=True)

if is_master:

try:

actual_size = os.stat(args.out_file).st_size

if actual_size != file_size:

msg = f"CRITICAL: Final file size mismatch! Expected {file_size}, got {actual_size}"

print(msg, flush=True)

logger.error(msg)

comm.Abort(1)

else:

logger.debug("Simulation completed and size verified.")

print("✅ HF completed successfully", flush=True)

except OSError as e:

msg = f"CRITICAL: Could not stat output file {args.out_file}: {e}"

print(msg, flush=True)

logger.error(msg)

comm.Abort(1)

Copilot

Pull request overview

This PR fixes a race condition in MPI HF simulations where file-size validation could run before all ranks finished writing, by moving final validation to the master rank after an MPI barrier.

Changes:

Removed the per-rank validate_end() logic that could trigger early while other ranks were still writing.
Performed final output file size verification on the master rank strictly after comm.Barrier().
Adjusted stdout printing to flush immediately for more reliable Slurm log output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-01T21:07:14Z

workflow/calculation/hf_sim.py

    # distribute work, must be sequential for optimisation,
    # and for validation function above to be thread safe


The comment about needing sequential distribution "for validation function above to be thread safe" is now stale since validate_end() was removed and validation happens on the master after the barrier. Please update/remove this comment to reflect the new synchronization/validation approach (and avoid misleading future maintainers).

Suggested change

# distribute work, must be sequential for optimisation,

# and for validation function above to be thread safe

# distribute work in a round-robin fashion across ranks for optimisation

Let master handle validate_end()

4d17d1b

Copilot AI review requested due to automatic review settings March 1, 2026 21:04

Copilot started reviewing on behalf of sungeunbae March 1, 2026 21:05 View session

sungeunbae requested review from AndrewRidden-Harper, joelridden and lispandfound March 1, 2026 21:05

gemini-code-assist bot reviewed Mar 1, 2026

View reviewed changes

Copilot AI reviewed Mar 1, 2026

View reviewed changes

comm.Abort(1) for the entire code to halt

722b278

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let master handle validate_end()#539

Let master handle validate_end()#539
sungeunbae wants to merge 2 commits intomasterfrom
hf_sim_fix

sungeunbae commented Mar 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# distribute work, must be sequential for optimisation,
		# and for validation function above to be thread safe

	# distribute work, must be sequential for optimisation,
	# and for validation function above to be thread safe
	# distribute work in a round-robin fashion across ranks for optimisation

Conversation

sungeunbae commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sungeunbae commented Mar 1, 2026 •

edited

Loading