Skip to content

Conversation

@mbercx
Copy link
Member

@mbercx mbercx commented Dec 8, 2025

@cpignedoli this is a rough draft PR for you to test! I have to sign off for the day, but already wanted to get your feedback.

To be clear: this is not ready for proper review. ^^ It's based on/blocked by #7116, pre-commit was ignored, no tests, no documentation changes. Just to see if it already works as expected and to get feedback on one question:

  1. if the process is paused because the max iterations of a handler was hit, which counters do we reset? Right now, only the corresponding handler counter is reset, or the global one in case that was hit. Should all counters be reset? Or should we reset the global counter as well in case we hit the max iterations of a single error handler?

Some code that may help in testing:

from aiida import orm, engine, load_profile
from aiida_quantumespresso.workflows.pw.base import PwBaseWorkChain
from ase.build import bulk

load_profile('dev')

code = orm.load_code("pw@localhost")

structure = orm.StructureData(ase=bulk('Si'))

builder = PwBaseWorkChain.get_builder_from_protocol(
    code, structure,
    protocol='precise'
)
# builder.max_iterations = 1
builder.handler_overrides = {
    'handle_out_of_walltime': {
        'max_iterations': 1
    }
}
builder.pause_on_max_iterations = True
builder.pw.parameters['CONTROL']['max_seconds'] = 3

engine.submit(builder)

@mbercx mbercx linked an issue Dec 8, 2025 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Dec 8, 2025

Codecov Report

❌ Patch coverage is 0% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 29.17%. Comparing base (f21bcd4) to head (09aa0b7).

Files with missing lines Patch % Lines
src/aiida/engine/processes/workchains/restart.py 0.00% 35 Missing ⚠️
src/aiida/engine/processes/workchains/utils.py 0.00% 4 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (f21bcd4) and HEAD (09aa0b7). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (f21bcd4) HEAD (09aa0b7)
2 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #7139       +/-   ##
===========================================
- Coverage   79.60%   29.17%   -50.43%     
===========================================
  Files         566      566               
  Lines       43584    43566       -18     
===========================================
- Hits        34691    12705    -21986     
- Misses       8893    30861    +21968     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cpignedoli
Copy link
Contributor

Hi @mbercx I tested it directly in the QE App aiida 2.7.1 aiida_quantumespresso 4.13

            relax_builder.base.handler_overrides = {
                'handle_out_of_walltime': {
                    'max_iterations': 1
                }
            }
            relax_builder.base.pause_on_max_iterations = True

For me it works, just a minor bug (see image below):
in re report it writes "5" as max iterations, probably because it takes it from self.inputs, and you redefine a different variable

max_handler_iterations = max_iterations_override if max_iterations_override else handler.max_iterations
image

@mbercx mbercx force-pushed the new/max-iter-pause branch 2 times, most recently from 607b133 to 1b391bf Compare December 12, 2025 00:58
@mbercx mbercx force-pushed the new/max-iter-pause branch from 1b391bf to bda4564 Compare December 12, 2025 04:17
@mbercx
Copy link
Member Author

mbercx commented Dec 12, 2025

Thanks for testing @cpignedoli! Good catch on the report. I fixed this, but realised a critical issue in my original implementation:

Returning an exit code in the should_run_process method - which is used in a while_ loop in the outline - is interpreted as True. This breaks the logic: another process will be run even though an exit code should be returned, and the workflow keeps on restarting even though the max iterations is reached. I'm not sure if this should be the case, but that is a discussion on plumpy that is beyond the scope of this PR.

To fix this, I've moved all the logic regarding max_iterations to where I think it should be: the inspect_process step. Some other improvements:

  1. The report now no longer claims to be restarting, then aborts because the max iterations has been reached. On the current main branch, you get (trimmed the prefix):

    ...]: PwCalculation<1560> failed but a handler dealt with the problem, restarting
    ...]: reached the maximum number of iterations 1: last ran PwCalculation<1560>
    

    Now you get:

    ...]: Reached the maximum number of global iterations (1).
    ...]: Aborting! Last ran: PwCalculation<1577>
    
  2. When pausing, the logic of resetting the counters is improved. Basically, the code checks each of the max iterations: global, and for each of the handlers. If max_iterations is hit and pause_on_max_iterations is False, the work chain aborts. Else the corresponding counter is already reset and the pause_process variable is set to True. After checking all max iterations the process is then paused. This gives a clear report:

    ...]: Reached the maximum number of global iterations (1).
    ...]: Reached the maximum number of iterations (1) for handler `handle_out_of_walltime`.
    ...]: PwCalculation<1614> failed but a handler dealt with the problem, restarting
    ...]: Resetting the iteration counter(s) and pausing for inspection. You can resume execution using `verdi process play 1609`, or kill the work chain using `verdi process kill 1609`.
    

    And because both the handler and global counters are reset in this case, you don't get the issue that the process will always pause again the next iteration because one of the two is still hitting max_iterations. However, other counters are unaffected. I think this is the correct behavior.

Tests and documentation have been added, so I think this one is now ready for proper review!


EDIT: spoke too soon; some more notes:

  1. Because of the changes, there is a small edge case where the global max_iterations is not respected. The check is only triggered when a handler has returned a ProcessHandlerReport. So in the case there is an unhandled failure, and the user has set the on_unhandled_failure to restart or pause, it's possible to go beyond max_iterations. I think this is fine: I think the max_iterations input is mostly to control how many times a handler should be triggered. In case the user specifies on_unhandled_failure to pause, but then the work chain aborts because the max iterations is hit, I think that would be annoying.
  2. I have to look more closely at the logic for error handlers such as (from the PwBaseWorkChain):
    def handle_vcrelax_converged_except_final_scf(self, calculation):
        """Handle `ERROR_IONIC_CONVERGENCE_REACHED_EXCEPT_IN_FINAL_SCF` exit code.

        Convergence reached in `vc-relax` except thresholds exceeded in final scf: consider as converged.
        """
        self.ctx.is_finished = True
        action = 'ionic convergence thresholds met except in final scf: consider structure relaxed.'
        self.report_error_handled(calculation, action)
        self.results()  # Call the results method to attach the output nodes
        return ProcessHandlerReport(True, self.exit_codes.ERROR_IONIC_CONVERGENCE_REACHED_EXCEPT_IN_FINAL_SCF)

Notice the "hack" here: self.ctx.is_finished = True is combined with self.results() to handle the use case of "this exit code is fine, just attach the outputs. I'm not sure I like this approach, but technically it's possible. I would have probably added a is_finished keyword to the ProcessHandlerReport, but for now we should still support this usage.

EDIT2: This second issue is now fixed in 09aa0b7, and I've added a test for it.

)
self.pause(f"Paused for user inspection, see: 'verdi process report {self.node.pk}'")

return last_report.exit_code
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the BaseRestartWorkChain used the fact that returning ExitCode(0) in a step of the outline doesn't really do anything as far as I can tell: it simply starts executing the next step. Might be there is not even any difference between returning that and None.

I think it's better to be explicit and return the exit code in case its exit status is nonzero, and return None otherwise.

This is the case as long as the last process has not finished successfully or a handler has been triggered.
"""
max_iterations = self.inputs.max_iterations.value
return not self.ctx.is_finished and self.ctx.iteration < max_iterations
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether or not the global max_iterations has been reached is now checked in the inspect_process step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

BaseRestartWorkChain: pause at max_iterations

2 participants