Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing.Barrier does not return if waiting process is terminated (tested on windows) #123899

Open
jpvolkmann opened this issue Sep 10, 2024 · 10 comments
Assignees
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@jpvolkmann
Copy link

jpvolkmann commented Sep 10, 2024

Bug report

Bug description:

I'm using multiprocessing.Barrier to synchronize processes. The individual processes might get terminated/killed by a GUI user interaction.
In case that a process, that already entered the waiting state, is killed, the barrier will never be released, a timeout is not taken into account.
When debugging this issue I realized that it is cause by

self._woken_count.acquire() # wait for a sleeper to wake

self._woken_count is not released by the terminated process and no timeout is specified in the call to self._woken_count.acquire()

If I add a (hardcoded...) timeout to self._woken_count.acquire(True, 2.0) the program continues as expected.
I hope that there a better approaches to fix this than using a hardcoded timeout...

Example script to reproduce error.

import logging
import time
import sys
import multiprocessing as mp
import multiprocessing.synchronize

from pathlib import Path

def process_main(process_index, barrier: multiprocessing.synchronize.Barrier):
    logging.debug('Process %d: Started', process_index)
    time.sleep(0.5)
    logging.debug('Process %d: Waiting for barrier', process_index)
    barrier.wait(timeout=5.0)
    logging.debug('Process %d: Barrier passed', process_index)
    time.sleep(0.5)
    logging.debug('Process %d: Terminated', process_index)

if __name__ == '__main__':
    # Set up logging
    logfile_name = Path(__file__).with_suffix('.log').name
    logging.basicConfig(
        level=logging.DEBUG,
        format='%(asctime)s %(levelname)s:%(name)s %(message)s',
        handlers=[
            logging.FileHandler(logfile_name, mode='w'),
            logging.StreamHandler(sys.stdout)
        ]    
    )

    instance_count = 4
    barrier = mp.Barrier(instance_count)

    processes = []
    for i in range(instance_count):
        runner_process = mp.Process(
            target=process_main, args=(i, barrier), daemon=True)
        processes.append(runner_process)

    for i, process in enumerate(processes):
        logging.debug('Starting process %d', i)
        process.start()
        time.sleep(0.200)

    # Terminate already waiting process
    logging.debug('Killing process 0')
    processes[0].kill()

    for process in processes:
        process.join()

CPython versions tested on:

3.12

Operating systems tested on:

Windows

@jpvolkmann jpvolkmann added the type-bug An unexpected behavior, bug, or error label Sep 10, 2024
@Zheaoli
Copy link
Contributor

Zheaoli commented Sep 10, 2024

Bug confirmed(on Linux)

@picnixz
Copy link
Contributor

picnixz commented Sep 10, 2024

Just wondering, but can you reproduce the issue without the logging part?

@jpvolkmann
Copy link
Author

jpvolkmann commented Sep 10, 2024

Just wondering, but can you reproduce the issue without the logging part?

Just tested without logging, same behavior as before. (as expected)
self._woken_count.acquire() is called in blocking mode without timeout

@Zheaoli
Copy link
Contributor

Zheaoli commented Sep 10, 2024

@picnixz would you mind assigning this issue to me (lol

@Zheaoli
Copy link
Contributor

Zheaoli commented Sep 13, 2024

Upon thorough investigation of this issue, I believe it should be classified as a missing feature rather than a bug.

Specifically, we need to implement a timeout action for the following:

  • The notify_all and notify methods of the Condition primitive
  • The wait method of the Barrier primitive

I'm currently considering whether to open a feature request on discuss.python.org for this enhancement.

cc @picnixz

@YvesDup
Copy link
Contributor

YvesDup commented Sep 13, 2024

Barrier class already has a timeout argument and its wait method too.
Have they not ? Or I miss something.

@thomas-lets-go
Copy link

I'm also a little confused, it seem that the problem is about killing process gracefully without causing deadlock, not about requesting new features from multiprocessing.Barrier.

Process.kill() is a brutal way to kill process, because exit handlers and finally clauses, etc., will not be executed, so Barrier is not properly cleaned up and it's internal state is not consistent, as a result other processes are blocked forever.

multiprocessing.Barrier is just a clone of threading.Barrier, which has all the necessary routines including Barrier(parties[, action[, timeout]]) and wait(timeout=None)

As far as i see, before Process.kill() is called in the main process, Barrier.abort() should be called first, and other process are supposed to deal with the relevant exception. anyway let me know if i misunderstand the point.

@Zheaoli
Copy link
Contributor

Zheaoli commented Sep 14, 2024

Barrier class already has a timeout argument and its wait method too.
Have they not ? Or I miss something.

You are right, the wait has timeout already. But it not effect on all the code path. FYI https://github.com/python/cpython/blob/main/Lib/threading.py#L724-L726

I think we should add timeout for the _release path.

@jpvolkmann
Copy link
Author

Killing the process was just used to demonstrate the effect, it's not the real use case...
I fully understand that killing a process is not the typical use-case and adding a timeout might has some unwanted side-effects.
Meanwhile I also figured out that the barrier is in "messed up state" even after I added the timeout to acquire() and can not be reset (because of pending locks (?) or _ncount does not reach 0). I'll find another solution for my use-case...

In general I used multiprocessing to separate processes with unreliable third-party binary libraries that can cause access violation in case of e.g. HW defects. Access Violation causes the processes to get killed.

@Zheaoli
Copy link
Contributor

Zheaoli commented Sep 14, 2024

As far as i see, before Process.kill() is called in the main process, Barrier.abort() should be called first, and other process are supposed to deal with the relevant exception. anyway let me know if i misunderstand the point.

Yes, this is the normal circumstance. But in some circumstances, like kill the process in GUI process or killed by systemed. we can't clean the code before the process exist. So for now, the timeout argument is not effective in some code path, FYI https://github.com/python/cpython/blob/main/Lib/threading.py#L724-L726. I think we need to pass the timeout argument to the _release code path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants