Skip to content

bpo-31699 Deadlocks in concurrent.futures.ProcessPoolExecutor with pickling error #3895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jan 5, 2018

Conversation

tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Oct 5, 2017

When using concurrent.futures.ProcessPoolExecutor with objects that are not picklable or unpicklable, several situations results in a deadlock, with the interpreter freezed.

This is the case for different scenario, for instance these three scripts. This PR propose to test the different failure scenario and fix the ones that results on deadlocks.

Overall, the goal is to make concurrent.futures.ProcessPoolExecutor more robust to faulty user code.

This work was done as part of the tommoral/loky#48 with the intent to re-use the executor in multiple independent part of the program, in collaboration with @ogrisel. See #1013 for more the details.

https://bugs.python.org/issue31699

@tomMoral tomMoral requested a review from rhettinger as a code owner October 5, 2017 07:41
@tomMoral tomMoral changed the title Deadlocks in concurrent.futures.ProcessPoolExecutor with pickling error bpo-31699 Deadlocks in concurrent.futures.ProcessPoolExecutor with pickling error Oct 5, 2017
@tomMoral tomMoral force-pushed the PR_robust_failures branch from 5865996 to 69d774f Compare October 5, 2017 07:48
@pitrou
Copy link
Member

pitrou commented Oct 5, 2017

At a high level, I'm a bit surprised by the strategy adopted by this PR. Isn't it possible to keep a functioning executor in case an error occurs when (un)pickling? What does multiprocessing.Pool do (I don't remember)?

The crashing process problem is separate IMHO (and arguably less likely), though it does deserve to be handled.

@tomMoral
Copy link
Contributor Author

tomMoral commented Oct 5, 2017

For unpickling errors, you have currently no way to get the work_id of the failure once the error occurred, that is why we resorted to flagging the executor as broken. Another strategy which could solve this would be to change the _ResultItem/_CallItem objects to have safe unpickle, which is guaranteed to give back at least the work_id, so in case of failures, we can just add the exception to this job. But I am not sure how to do it.
Note that multiprocessing.Pool is also broken when a result fails at unpickle. (see here ).
If you think this is necessary for this PR, I can look into it.

For pickling errors, this is much simpler and I should add something for the result pickling error to make sure we do not fail the executor in this case. The pickling error for the call item is already handled that way.

@tomMoral
Copy link
Contributor Author

tomMoral commented Oct 5, 2017

After thinking a bit, my main concern about safe pickling/unpickling of _ResultItem/_CallItem is with the fact that we are agnostic of the Pickler objects used in the call_queue and results_queue. This should be the _ForkingPickler, but this could be changed, depending on the context, as we do in the loky project. So I am not sure if it is the right solution.

@tomMoral tomMoral force-pushed the PR_robust_failures branch 2 times, most recently from 8427d36 to c70c4d6 Compare October 23, 2017 16:36
@ogrisel
Copy link
Contributor

ogrisel commented Oct 31, 2017

@tomMoral can you please summarize the change you did in the last commit. The tests deadlock now. I think this is no longer robust to the ExitAtUnpickle case (which I think is fine but it should be removed from the test suite).

@tomMoral
Copy link
Contributor Author

tomMoral commented Nov 2, 2017

@ogrisel In the last commit, I just removed the crash detection thread and the cases in the test suite not handled anymore. It works locally but it seems to fail deterministically on travis with clang. I cannot reproduce it even with a docker image from travis.

@tomMoral tomMoral force-pushed the PR_robust_failures branch 2 times, most recently from e8595c5 to 7ad6721 Compare November 3, 2017 10:22
@ogrisel
Copy link
Contributor

ogrisel commented Nov 3, 2017

I managed to reproduce the deadlock (without -j4 and with --timeout=300):

test_crash (test.test_concurrent_futures.ProcessPoolForkserverExecutorDeadlockTest) ... error at task pickle
exit at task unpickle
error at task unpickle
crash at task unpickle
crash during func execution on worker
exit during func execution on worker
error during func execution on worker
crash during result pickle on worker
exit during result pickle on worker
error during result pickle on worker
error during result unpickle in result_handler

Timeout (0:05:00)!
Thread 0x00007fa7a0a88700 (most recent call first):
  File "/home/parietal/ogrisel/cpython/Lib/threading.py", line 296 in wait
  File "/home/parietal/ogrisel/cpython/Lib/multiprocessing/queues.py", line 224 in _feed
  File "/home/parietal/ogrisel/cpython/Lib/threading.py", line 865 in run
  File "/home/parietal/ogrisel/cpython/Lib/threading.py", line 917 in _bootstrap_inner
  File "/home/parietal/ogrisel/cpython/Lib/threading.py", line 885 in _bootstrap

Thread 0x00007fa7a24a1700 (most recent call first):
  File "/home/parietal/ogrisel/cpython/Lib/selectors.py", line 415 in select
  File "/home/parietal/ogrisel/cpython/Lib/multiprocessing/connection.py", line 920 in wait
  File "/home/parietal/ogrisel/cpython/Lib/concurrent/futures/process.py", line 308 in _queue_management_worker
  File "/home/parietal/ogrisel/cpython/Lib/threading.py", line 865 in run
  File "/home/parietal/ogrisel/cpython/Lib/threading.py", line 917 in _bootstrap_inner
  File "/home/parietal/ogrisel/cpython/Lib/threading.py", line 885 in _bootstrap

Thread 0x00007fa7a748b700 (most recent call first):
  File "/home/parietal/ogrisel/cpython/Lib/multiprocessing/synchronize.py", line 96 in __enter__
  File "/home/parietal/ogrisel/cpython/Lib/multiprocessing/queues.py", line 357 in put
  File "/home/parietal/ogrisel/cpython/Lib/concurrent/futures/process.py", line 567 in shutdown
  File "/home/parietal/ogrisel/cpython/Lib/test/test_concurrent_futures.py", line 873 in test_crash
  File "/home/parietal/ogrisel/cpython/Lib/unittest/case.py", line 615 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/case.py", line 663 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 122 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 122 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 122 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/runner.py", line 176 in run
  File "/home/parietal/ogrisel/cpython/Lib/test/support/__init__.py", line 1892 in _run_suite
  File "/home/parietal/ogrisel/cpython/Lib/test/support/__init__.py", line 1936 in run_unittest
  File "/home/parietal/ogrisel/cpython/Lib/test/test_concurrent_futures.py", line 1185 in test_main
  File "/home/parietal/ogrisel/cpython/Lib/test/support/__init__.py", line 2068 in decorator
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/runtest.py", line 176 in runtest_inner
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/runtest.py", line 140 in runtest
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 379 in run_tests_sequential
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 458 in run_tests
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 536 in _main
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 510 in main
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 585 in main
  File "/home/parietal/ogrisel/cpython/Lib/test/__main__.py", line 2 in <module>
  File "/home/parietal/ogrisel/cpython/Lib/runpy.py", line 85 in _run_code
  File "/home/parietal/ogrisel/cpython/Lib/runpy.py", line 193 in _run_module_as_main

@ogrisel
Copy link
Contributor

ogrisel commented Nov 3, 2017

Here is another run, this time the GC of the executor triggers the deadlock (via the wearkref callback):

test_crash (test.test_concurrent_futures.ProcessPoolForkserverExecutorDeadlockTest) ... error at task pickle
exit at task unpickle
error at task unpickle
crash at task unpickle
crash during func execution on worker
exit during func execution on worker
error during func execution on worker
crash during result pickle on worker
exit during result pickle on worker
error during result pickle on worker
error during result unpickle in result_handler
Timeout (0:01:00)!
Thread 0x00007f8bf9956700 (most recent call first):
  File "/home/parietal/ogrisel/cpython/Lib/multiprocessing/synchronize.py", line 96 in __enter__
  File "/home/parietal/ogrisel/cpython/Lib/multiprocessing/queues.py", line 357 in put
  File "/home/parietal/ogrisel/cpython/Lib/concurrent/futures/process.py", line 486 in weakref_cb
  File "/home/parietal/ogrisel/cpython/Lib/unittest/case.py", line 643 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/case.py", line 663 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 122 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 122 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 122 in run
  File "/home/parietal/ogrisel/cpython/Lib/unittest/suite.py", line 84 in __call__
  File "/home/parietal/ogrisel/cpython/Lib/unittest/runner.py", line 176 in run
  File "/home/parietal/ogrisel/cpython/Lib/test/support/__init__.py", line 1892 in _run_suite
  File "/home/parietal/ogrisel/cpython/Lib/test/support/__init__.py", line 1936 in run_unittest
  File "/home/parietal/ogrisel/cpython/Lib/test/test_concurrent_futures.py", line 1186 in test_main
  File "/home/parietal/ogrisel/cpython/Lib/test/support/__init__.py", line 2068 in decorator
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/runtest.py", line 176 in runtest_inner
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/runtest.py", line 140 in runtest
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 379 in run_tests_sequential
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 458 in run_tests
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 536 in _main
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 510 in main
  File "/home/parietal/ogrisel/cpython/Lib/test/libregrtest/main.py", line 585 in main
  File "/home/parietal/ogrisel/cpython/Lib/test/__main__.py", line 2 in <module>
  File "/home/parietal/ogrisel/cpython/Lib/runpy.py", line 85 in _run_code
  File "/home/parietal/ogrisel/cpython/Lib/runpy.py", line 193 in _run_module_as_main

@ogrisel
Copy link
Contributor

ogrisel commented Nov 3, 2017

Note: I also had to use taskset -c 0 python -m test ... to trigger the race condition. The problem is the lock of the result handler queue that has not yet been released when the worker process is terminated.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 7, 2017

@tomMoral this PR is now conflicting with master.

- TST crash in CallItem unpickling
- TST crash in func call run
- TST crash in result pickling
the test include crashes with PythonError/SystemExist/SegFault

Also add more tests for race condition when a worker crashes
@tomMoral
Copy link
Contributor Author

tomMoral commented Nov 9, 2017

This PR has been rebased on master and I think it is ready for review.

@@ -75,12 +76,36 @@
_threads_queues = weakref.WeakKeyDictionary()
_global_shutdown = False


# This constants control the maximal wakeup. If a job is submitted to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "This constant controls"

if reader in ready:
result_item = reader.recv()
# Wait for a result to be ready in the result_queue while checking
# that worker process are still running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "that all worker processes are still running"

# result_queue state. This avoid deadlocks caused by the non
# transmission of wakeup signal when a worker died with the
# _result_queue write lock.
self._wakeup = _Sentinel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should better be renamed to _queue_management_thread_sentinel to be more explicit.

self._queue_management_thread.daemon = True
self._queue_management_thread.start()
_threads_queues[self._queue_management_thread] = self._result_queue
_threads_queues[self._queue_management_thread] = self._wakeup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_threads_queues should be renamed to _threads_sentinels.

p.terminate()
executor.shutdown(wait=True)
print(f"\nTraceback:\n {tb}", file=sys.__stderr__)
self.fail(f"Deadlock executor:\n\n{tb}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phrasing: "Executor deadlock:"

from signal import SIGTERM as SIGKILL
try:
os.kill(pid, SIGKILL)
time.sleep(.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this sleep necessary? Shouldn't it be part of the caller instead? Or alternatively, kill several times until the OS replies that the process is dead:

n_trials = 3

for i in range(n_trials):
    try:
        os.kill(pid, SIGKILL)
    except (ProcessLookupError, PermissionError):
        break
    time.sleep(0.01)
else:
    raise RuntimeError(f"Could not kill process {pid} after {n_trials} trials")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sleep increases the chances for this process to be killed by another process and give time to make sure that the BrokenProcessPool error occurs before the end of the current function.
I think that if a process fail to kill another one, it means that it was already shutdown so we should not make multiple tries.

tb = f.read()
for p in executor._processes.values():
p.terminate()
executor.shutdown(wait=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the executor is in a deadlock state, shutdown(wait=True) will never return, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we kill the processes just before, this should be safe. In this case, if the queue_management_worker is alive, it will flag the ProcessPoolExecutor as broken and clean up the state and if it is not, the call to shutdown won't be blocking and the ressources will still be freed.

import faulthandler
from tempfile import TemporaryFile
with TemporaryFile(mode="w+") as f:
faulthandler.dump_traceback(file=f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not file=sys.stderr?

Copy link
Contributor

@ogrisel ogrisel Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, faulthandler actually needs a system-level file handle, not just a Python-level file object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code is called with test.support.captured_stderr. In this case, the sys.stderr object does not have a fileno and fault_handler.dump_traceback fails. So we resorted to this to allow getting the traceback in the exception. Another option is to dump the traceback on sys.stdout.

Copy link
Member

@pitrou pitrou Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Do you think this whole code is still useful? I understand it was useful for you to debug the PR, but now that it seems to pass, is there a reason for keeping it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to keep it would be to have an informative test report in case of a regression introducing a deadlock. As deadlocks potentially happen randomly, it is always good to have a traceback asap when it happens. So I would rather keep it.
But we could move it to sys.stdout if you think the usage of a tempfile introduces unnecessary complexity. The test report would still have the information but in a degraded format: not in the test report, but in the test stdout.

@ogrisel
Copy link
Contributor

ogrisel commented Jan 4, 2018

@pitrou I think all the comments have been addressed in this PR.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel thanks for pinging me (did you amend previous commits? I didn't see any recent commits). There are still a couple nits left.

@@ -94,6 +95,7 @@ def tearDown(self):


class ExecutorMixin:
timeout = 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem used anymore.

@@ -116,6 +118,8 @@ def setUp(self):
self._prime_executor()

def tearDown(self):
# Remove the reference to self.timer to avoid the thread_cleanup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this comment doesn't seem necessary anymore. Or am I missing something?

call_queue.close()
# Make sure that the queue management thread was properly finished
# and the queue was closed by the shutdown process
queue_management_thread.join()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already joined 6 lines above. The second join() call is a no-op, no?

@@ -759,6 +766,180 @@ def test_ressources_gced_in_workers(self):
ProcessPoolForkserverMixin,
ProcessPoolSpawnMixin))

def hide_process_stderr():
import io
setattr(sys, "stderr", io.StringIO())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a subtlety that I'm missing, or can this simply be written as sys.stderr = io.StringIO()?



class ExitAtPickle(object):
"""Bad object that triggers a segfault at pickling time."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is wrong here.



class ErrorAtPickle(object):
"""Bad object that triggers a segfault at pickling time."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here as well.



class ErrorAtUnpickle(object):
"""Bad object that triggers a process exit at unpickling time."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here too :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classic copy/paste mistake... Thanks for pointing it out! :-)

executor.shutdown(wait=True)

@classmethod
def _test_getpid(cls, a):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem used anymore?



class ExecutorDeadlockTest:
# If ExecutorDeadlockTest takes more than 100secs to complete, it is very
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"100secs" isn't in sync with the number below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was a duplicate from the one 3 lines below so I removed it.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@pitrou
Copy link
Member

pitrou commented Jan 5, 2018

@tomMoral thanks for the updates! This is looking good to me now.

@pitrou pitrou merged commit 94459fd into python:master Jan 5, 2018
@ogrisel
Copy link
Contributor

ogrisel commented Jan 5, 2018

Great, thanks for the review @pitrou!

@adam-kral
Copy link

adam-kral commented Dec 9, 2021

Hi, according to github the fix commit is long in python 3.8, right?

However, it still hangs without an error, for example the code you posted originally:

https://gist.github.com/tomMoral/cc27a938d669edcf0286c57516942369#file-fail_pickle_input_shutdown-py

Do I understand it correctly that it should result in an error now rather than just hang?

@ogrisel
Copy link
Contributor

ogrisel commented Dec 9, 2021

https://gist.github.com/tomMoral/cc27a938d669edcf0286c57516942369#file-fail_pickle_input_shutdown-py
Do I understand it correctly that it should result in an error now rather than just hang?

I think this one should terminate without hanging and without errors (since we do not call f.result()). I cannot reproduce the hanging with Python 3.9 on macOS. Which Operating System do you use @adam-kral?

@adam-kral
Copy link

adam-kral commented Dec 9, 2021

I use Ubuntu 20.04.3 LTS and Python 3.8.10. Tried Python 3.7.12, hangs as well.

Traceback on KeyboardInterrupt
Traceback (most recent call last):
Process ForkProcess-1:
Process ForkProcess-3:
  File "/home/adam/.config/JetBrains/PyCharm2021.3/scratches/scratch_2.py", line 33, in <module>
Process ForkProcess-2:
    f = e.submit(id, ObjectWithPickleError())
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 644, in __exit__
    self.shutdown(wait=True)
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 686, in shutdown
    self._queue_management_thread.join()
  File "/usr/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 96, in get
    with self._rlock:
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 96, in get
    with self._rlock:
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 97, in get
    res = self._recv_bytes()
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
    elif lock.acquire(block, timeout):
KeyboardInterrupt
Process ForkProcess-4:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 96, in get
    with self._rlock:
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

@adam-kral
Copy link

adam-kral commented Dec 9, 2021

However, with Python 3.9.9, it exits as expected.

Shouldn't this mean, this fix should be in Python 3.7 and newer?
image

@ogrisel
Copy link
Contributor

ogrisel commented Dec 9, 2021

I can indeed reproduce the hanging with Python 3.8. This is weird because this PR was merged as 94459fd which is an ancestor commit of the v3.7.0 tag (and all subsequent Python releases). Maybe the shutdown case was fixed for real only in a follow-up bugfix merged between 3.8 and 3.9.

@adam-kral
Copy link

Yeah thanks!

It's not the first time I had this bug. Impossible to debug, unless you google thoroughly.

Since then, I forgot about the specifics, but it has manifested again. Basically I think that an also common case is that someone provides a function that is not picklable (vs an argument). A lambda or a local function.

My usual workflow was:

my_iterable = ...

with ProcessPoolExecutor() as executor:
    def process_item(item):
        ...
        return result
    results = executor.map(process_item, my_iterable)

# do sth with results

That works well. However, I needed to move the code from if __name__ == '__main__' stanza to a function because of setuptools, to define entry_points -> console_scripts.
In that moment, the function became local and unpicklable. All I got was this hanging.

@ogrisel
Copy link
Contributor

ogrisel commented Dec 9, 2021

Alright, thanks for the details.

In that moment, the function became local and unpicklable.

You might want to have a look at https://github.com/joblib/loky !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants