-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
bpo-31699 Deadlocks in concurrent.futures.ProcessPoolExecutor
with pickling error
#3895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
concurrent.futures.ProcessPoolExecutor
with pickling errorconcurrent.futures.ProcessPoolExecutor
with pickling error
5865996
to
69d774f
Compare
At a high level, I'm a bit surprised by the strategy adopted by this PR. Isn't it possible to keep a functioning executor in case an error occurs when (un)pickling? What does The crashing process problem is separate IMHO (and arguably less likely), though it does deserve to be handled. |
For unpickling errors, you have currently no way to get the For pickling errors, this is much simpler and I should add something for the result pickling error to make sure we do not fail the executor in this case. The pickling error for the call item is already handled that way. |
After thinking a bit, my main concern about safe pickling/unpickling of |
8427d36
to
c70c4d6
Compare
@tomMoral can you please summarize the change you did in the last commit. The tests deadlock now. I think this is no longer robust to the |
c70c4d6
to
3c74cb2
Compare
@ogrisel In the last commit, I just removed the crash detection thread and the cases in the test suite not handled anymore. It works locally but it seems to fail deterministically on travis with |
e8595c5
to
7ad6721
Compare
I managed to reproduce the deadlock (without -j4 and with --timeout=300):
|
Here is another run, this time the GC of the executor triggers the deadlock (via the wearkref callback):
|
7ad6721
to
7d886ad
Compare
Note: I also had to use |
b37efc5
to
3de1ee7
Compare
@tomMoral this PR is now conflicting with master. |
- TST crash in CallItem unpickling - TST crash in func call run - TST crash in result pickling the test include crashes with PythonError/SystemExist/SegFault Also add more tests for race condition when a worker crashes
3de1ee7
to
04e1dc1
Compare
This PR has been rebased on |
Lib/concurrent/futures/process.py
Outdated
@@ -75,12 +76,36 @@ | |||
_threads_queues = weakref.WeakKeyDictionary() | |||
_global_shutdown = False | |||
|
|||
|
|||
# This constants control the maximal wakeup. If a job is submitted to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "This constant controls"
Lib/concurrent/futures/process.py
Outdated
if reader in ready: | ||
result_item = reader.recv() | ||
# Wait for a result to be ready in the result_queue while checking | ||
# that worker process are still running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "that all worker processes are still running"
Lib/concurrent/futures/process.py
Outdated
# result_queue state. This avoid deadlocks caused by the non | ||
# transmission of wakeup signal when a worker died with the | ||
# _result_queue write lock. | ||
self._wakeup = _Sentinel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should better be renamed to _queue_management_thread_sentinel
to be more explicit.
Lib/concurrent/futures/process.py
Outdated
self._queue_management_thread.daemon = True | ||
self._queue_management_thread.start() | ||
_threads_queues[self._queue_management_thread] = self._result_queue | ||
_threads_queues[self._queue_management_thread] = self._wakeup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_threads_queues
should be renamed to _threads_sentinels
.
Lib/test/test_concurrent_futures.py
Outdated
p.terminate() | ||
executor.shutdown(wait=True) | ||
print(f"\nTraceback:\n {tb}", file=sys.__stderr__) | ||
self.fail(f"Deadlock executor:\n\n{tb}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phrasing: "Executor deadlock:"
Lib/test/test_concurrent_futures.py
Outdated
from signal import SIGTERM as SIGKILL | ||
try: | ||
os.kill(pid, SIGKILL) | ||
time.sleep(.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this sleep necessary? Shouldn't it be part of the caller instead? Or alternatively, kill several times until the OS replies that the process is dead:
n_trials = 3
for i in range(n_trials):
try:
os.kill(pid, SIGKILL)
except (ProcessLookupError, PermissionError):
break
time.sleep(0.01)
else:
raise RuntimeError(f"Could not kill process {pid} after {n_trials} trials")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sleep increases the chances for this process to be killed by another process and give time to make sure that the BrokenProcessPool
error occurs before the end of the current function.
I think that if a process fail to kill another one, it means that it was already shutdown so we should not make multiple tries.
Lib/test/test_concurrent_futures.py
Outdated
tb = f.read() | ||
for p in executor._processes.values(): | ||
p.terminate() | ||
executor.shutdown(wait=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the executor is in a deadlock state, shutdown(wait=True)
will never return, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we kill the processes just before, this should be safe. In this case, if the queue_management_worker
is alive, it will flag the ProcessPoolExecutor
as broken and clean up the state and if it is not, the call to shutdown won't be blocking and the ressources will still be freed.
Lib/test/test_concurrent_futures.py
Outdated
import faulthandler | ||
from tempfile import TemporaryFile | ||
with TemporaryFile(mode="w+") as f: | ||
faulthandler.dump_traceback(file=f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not file=sys.stderr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly, faulthandler
actually needs a system-level file handle, not just a Python-level file object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the code is called with test.support.captured_stderr
. In this case, the sys.stderr
object does not have a fileno and fault_handler.dump_traceback
fails. So we resorted to this to allow getting the traceback in the exception. Another option is to dump the traceback on sys.stdout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Do you think this whole code is still useful? I understand it was useful for you to debug the PR, but now that it seems to pass, is there a reason for keeping it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to keep it would be to have an informative test report in case of a regression introducing a deadlock. As deadlocks potentially happen randomly, it is always good to have a traceback asap when it happens. So I would rather keep it.
But we could move it to sys.stdout
if you think the usage of a tempfile introduces unnecessary complexity. The test report would still have the information but in a degraded format: not in the test report, but in the test stdout.
5da1f92
to
874f263
Compare
@pitrou I think all the comments have been addressed in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel thanks for pinging me (did you amend previous commits? I didn't see any recent commits). There are still a couple nits left.
Lib/test/test_concurrent_futures.py
Outdated
@@ -94,6 +95,7 @@ def tearDown(self): | |||
|
|||
|
|||
class ExecutorMixin: | |||
timeout = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem used anymore.
Lib/test/test_concurrent_futures.py
Outdated
@@ -116,6 +118,8 @@ def setUp(self): | |||
self._prime_executor() | |||
|
|||
def tearDown(self): | |||
# Remove the reference to self.timer to avoid the thread_cleanup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this comment doesn't seem necessary anymore. Or am I missing something?
Lib/test/test_concurrent_futures.py
Outdated
call_queue.close() | ||
# Make sure that the queue management thread was properly finished | ||
# and the queue was closed by the shutdown process | ||
queue_management_thread.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already joined 6 lines above. The second join()
call is a no-op, no?
Lib/test/test_concurrent_futures.py
Outdated
@@ -759,6 +766,180 @@ def test_ressources_gced_in_workers(self): | |||
ProcessPoolForkserverMixin, | |||
ProcessPoolSpawnMixin)) | |||
|
|||
def hide_process_stderr(): | |||
import io | |||
setattr(sys, "stderr", io.StringIO()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a subtlety that I'm missing, or can this simply be written as sys.stderr = io.StringIO()
?
Lib/test/test_concurrent_futures.py
Outdated
|
||
|
||
class ExitAtPickle(object): | ||
"""Bad object that triggers a segfault at pickling time.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is wrong here.
Lib/test/test_concurrent_futures.py
Outdated
|
||
|
||
class ErrorAtPickle(object): | ||
"""Bad object that triggers a segfault at pickling time.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here as well.
Lib/test/test_concurrent_futures.py
Outdated
|
||
|
||
class ErrorAtUnpickle(object): | ||
"""Bad object that triggers a process exit at unpickling time.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here too :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The classic copy/paste mistake... Thanks for pointing it out! :-)
Lib/test/test_concurrent_futures.py
Outdated
executor.shutdown(wait=True) | ||
|
||
@classmethod | ||
def _test_getpid(cls, a): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem used anymore?
Lib/test/test_concurrent_futures.py
Outdated
|
||
|
||
class ExecutorDeadlockTest: | ||
# If ExecutorDeadlockTest takes more than 100secs to complete, it is very |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"100secs" isn't in sync with the number below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was a duplicate from the one 3 lines below so I removed it.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@tomMoral thanks for the updates! This is looking good to me now. |
Great, thanks for the review @pitrou! |
Hi, according to github the fix commit is long in python 3.8, right? However, it still hangs without an error, for example the code you posted originally: https://gist.github.com/tomMoral/cc27a938d669edcf0286c57516942369#file-fail_pickle_input_shutdown-py Do I understand it correctly that it should result in an error now rather than just hang? |
I think this one should terminate without hanging and without errors (since we do not call |
I use Ubuntu 20.04.3 LTS and Python 3.8.10. Tried Python 3.7.12, hangs as well. Traceback on KeyboardInterrupt
|
I can indeed reproduce the hanging with Python 3.8. This is weird because this PR was merged as 94459fd which is an ancestor commit of the v3.7.0 tag (and all subsequent Python releases). Maybe the shutdown case was fixed for real only in a follow-up bugfix merged between 3.8 and 3.9. |
Yeah thanks! It's not the first time I had this bug. Impossible to debug, unless you google thoroughly. Since then, I forgot about the specifics, but it has manifested again. Basically I think that an also common case is that someone provides a function that is not picklable (vs an argument). A lambda or a local function. My usual workflow was: my_iterable = ...
with ProcessPoolExecutor() as executor:
def process_item(item):
...
return result
results = executor.map(process_item, my_iterable)
# do sth with results That works well. However, I needed to move the code from |
Alright, thanks for the details.
You might want to have a look at https://github.com/joblib/loky ! |
When using
concurrent.futures.ProcessPoolExecutor
with objects that are not picklable or unpicklable, several situations results in a deadlock, with the interpreter freezed.This is the case for different scenario, for instance these three scripts. This PR propose to test the different failure scenario and fix the ones that results on deadlocks.
Overall, the goal is to make
concurrent.futures.ProcessPoolExecutor
more robust to faulty user code.This work was done as part of the tommoral/loky#48 with the intent to re-use the executor in multiple independent part of the program, in collaboration with @ogrisel. See #1013 for more the details.
https://bugs.python.org/issue31699