bpo-30006 More robust concurrent.futures.ProcessPoolExecutor #1013

tomMoral · 2017-04-06T12:32:58Z

This set of patches intends to fix some silent deadlocks that can happened with
the present ProcessPoolExecutor implementation. The deadlock situations
notably include:

Pickling and Unpickling errors in task definitions and results.
Processes that get killed in non pythonic ways (kill -9 / segfault).

The commit message of each patch gives more details but in essence:

Support for multiprocessing context instead of forcing the use of fork on
posix systems.
Avoid deadlock when a process dies with the result_queue.wlock acquired.
Deadlock free clean up of failures in the queue_manager_thread and the
queue_feeder_thread.
Deadlock free termination of workers and threads when the
ProcessPoolExecutor instance has beeen gc'ed.

The commits are incremental, each one adds new fixes on top of the previous
ones to handle extra deadlock situations. If you think that some changes in the
last commits need more discussion or refinement, let me now so I can separate
them and open different tickets/ PR.

Each commit passes the test suit on its own but the addition of the 3 context
testing makes test_concurrent_futures longer (~140s on my computer). One
fix would be to split the test for each context in seprate files (allowing test
parallelization). I did not do it as it changes the structure of the test suit but
I can implement it if it is necessary.

This work was done as part of the loky project in collaboration with
@ogrisel. It provides a backport of the same features for older versions of
python including 2.7 for legacy users and reusable executors.

https://bugs.python.org/issue30006

mention-bot · 2017-04-06T12:33:00Z

@tomMoral, thanks for your PR! By analyzing the history of the files in this pull request, we identified @brianquinlan, @asvetlov and @shibturn to be potential reviewers.

the-knights-who-say-ni · 2017-04-06T12:33:01Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

Thanks again to your contribution and we look forward to looking at it!

ogrisel · 2017-04-07T13:57:01Z

I have a hard time understanding the codecov report on this PR but it seems that the way tests are currently run on travis, it is not possible to collect coverage data in part of the code that is only executed in multiprocessing worker processes.

For loky we had to use advanced coverage configuration to collect such data, see for instance:

Please let us know if you want us to adapt such configuration to better collect coverage data on children process code in the cpython project.

tomMoral · 2017-06-02T15:04:05Z

@pitrou Rebasing this PR on top of the current master broke our deadlock detection test on unpicklable tasks.
See bpo-30414

tomMoral · 2017-06-13T14:13:53Z

I spotted a design mistake and I forgot to add test for the Queue.
I am on it right now.

ogrisel · 2017-06-14T06:54:27Z

Lib/test/_test_multiprocessing.py

+
+        class NotSerializable(object):
+            def __init__(self):
+                self.pass_test = True


This should be set to False by default, otherwise the test is too easy.

If I set it to False, the test will always fail as I use &= afterward.

Ok then to make the test more readable we should have two flags:

one named reduce_was_called set to False in __init__ and subsequently set to True in __reduce__

another named on_queue_feeder_error_was_called set to False in __init__ and subsequently set to True in _on_queue_feeder_error by the Queue itself if e and obj have the expected type.

And the reverse the order of q.put(unserializable_obj) and q.put(True) to as to make sure that we can make self.assertTrue on the flags after self.assertTrue(q.get(timeout=0.1)) has succeeded.

Ok I did it and also updated the timeout to match the new timeout from #2148

tomMoral · 2017-06-15T07:57:20Z

I updated this PR in the last few days:

I adapted the code base to deal with the unfailing Queue (see bpo30414)
I removed the update in coverage to avoid adding unrelated changes.

tomMoral/loky#48 * Add context argument to allow non forking ProcessPoolExecutor * Do some cleaning (pep8+nonused code+naming) * Liberate the ressource earlier in the `_worker_process`

tomMoral/loky#48 This avoids deadlocks if a Process dies while: * Unpickling the _CallItem * Pickling a _ResultItem Wakeups are done with _Sentinel object that cannot be used with wait. We do not use a Connection/Queue as it brings lot of overhead in the Executor to use only a small part of it. We might want to implement a Sentinel object that can be waited upon to simplify and robustify the code. Test no deadlock with crashes * TST crash in CallItem unpickling * TST crash in func call run * TST crash in REsult pickling the test include crashes with PythonError/SystemExist/SegFault

tomMoral/loky#48 This extra thread checks that the _queue_manager_thread is alive and working. If not, it permits to avoid deadlocks and raise an appropriate Error. It also checks that the QueueFeederThread is alive.

Add a _ExecutorFlags object that hold the state of the ProcessPoolExecutor. This permits to introspect the executor state even after it has been gc and allow to handle correctly the Errors. It also introduces a ShutdownExecutorError for jobs that were cancel on shutdown. Also, this changes the `for` loop on `processes` to while loop to avoid concurrent dictionary updates errors.

tomMoral · 2017-09-08T16:44:51Z

This PR still requires some small fixes (particularly to avoid multiple test run in travis gcc tests).
But feel free to have a look at the diff if you want to start a discussion.

pitrou · 2017-09-20T18:22:56Z

@tomMoral, do you think it would be possible to split this PR into several ones based on the different issues being fixed (or improvements being made)? I think I'll have a hard time reviewing the PR as is, given how delicate multiprocessing (or concurrent.futures) code generally is.

tomMoral · 2017-09-21T09:05:53Z

@pitrou Yes no problem. I created a PR #3682 for the first part which allow to pass a context in the ProcessPoolExecutor constructor. I will create other PR for each part when I get a bit more time but we can start with this one.

tomMoral · 2017-10-05T09:35:38Z

@pitrou I created a second PR #3895 for the third commit (I realised the second commit was not necessary). It should fix the interpreter freeze caused by pickling/unpickling errors.

ogrisel · 2017-11-27T14:29:20Z

I think we can close this PR in favor of #3895 and #4256.

pitrou · 2017-11-27T15:10:59Z

Ok, closing.

the-knights-who-say-ni added the CLA not signed label Apr 6, 2017

tomMoral force-pushed the PR_robust_executor branch from 47a1046 to 389eb02 Compare April 6, 2017 12:58

tomMoral changed the title ~~More robust concurrent.futures.ProcessPoolExecutor~~ bpo-30006 More robust concurrent.futures.ProcessPoolExecutor Apr 6, 2017

tomMoral force-pushed the PR_robust_executor branch 2 times, most recently from 874f50e to 9859dc5 Compare April 7, 2017 12:34

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Apr 7, 2017

tomMoral force-pushed the PR_robust_executor branch from 9859dc5 to f8e81e0 Compare April 7, 2017 12:52

tomMoral force-pushed the PR_robust_executor branch 10 times, most recently from 7f67dfd to e8ce1b6 Compare April 13, 2017 18:13

tomMoral force-pushed the PR_robust_executor branch 2 times, most recently from 3bd48fe to c15c8d6 Compare May 12, 2017 13:44

tomMoral force-pushed the PR_robust_executor branch from c15c8d6 to 1f3bbc5 Compare June 2, 2017 14:13

tomMoral mentioned this pull request Jun 2, 2017

Port robust ProcessPoolExecutor to concurrent.futures joblib/loky#48

Closed

11 tasks

tomMoral force-pushed the PR_robust_executor branch 2 times, most recently from 79ce609 to 495a2d2 Compare June 12, 2017 13:24

tomMoral force-pushed the PR_robust_executor branch from 495a2d2 to 5df6c68 Compare June 13, 2017 15:51

tomMoral force-pushed the PR_robust_executor branch from 5df6c68 to d988a24 Compare June 13, 2017 16:08

ogrisel reviewed Jun 14, 2017

View reviewed changes

tomMoral force-pushed the PR_robust_executor branch from d988a24 to 6a77782 Compare June 14, 2017 16:22

tomMoral force-pushed the PR_robust_executor branch from 6a77782 to 95b1957 Compare June 26, 2017 10:47

tomMoral added 6 commits September 8, 2017 14:59

Add context management for ProcessPoolExecutor+CLN

f2f41e0

tomMoral/loky#48 * Add context argument to allow non forking ProcessPoolExecutor * Do some cleaning (pep8+nonused code+naming) * Liberate the ressource earlier in the `_worker_process`

Add _management_thread to check queue_mgr failures

7e8d759

tomMoral/loky#48 This extra thread checks that the _queue_manager_thread is alive and working. If not, it permits to avoid deadlocks and raise an appropriate Error. It also checks that the QueueFeederThread is alive.

CLN add NEWS.d entry

37e928c

FIX test broken by rebasing

ad34f56

tomMoral force-pushed the PR_robust_executor branch from 95b1957 to ad34f56 Compare September 8, 2017 13:45

tomMoral requested a review from rhettinger as a code owner September 8, 2017 13:45

tomMoral force-pushed the PR_robust_executor branch from 8b78bd1 to 7bbdf51 Compare September 8, 2017 16:17

ENH improve shutdown robustness from loky experience

3b07a2b

tomMoral force-pushed the PR_robust_executor branch from 7bbdf51 to 3b07a2b Compare September 8, 2017 16:24

FIX avoid launching multiple tests

16ffe65

tomMoral mentioned this pull request Sep 21, 2017

bpo-31540 Add context management for concurrent.futures.ProcessPoolExecutor #3682

Merged

tomMoral mentioned this pull request Oct 5, 2017

bpo-31699 Deadlocks in concurrent.futures.ProcessPoolExecutor with pickling error #3895

Merged

Mariatta added needs rebase and removed needs rebase labels Oct 9, 2017

tomMoral mentioned this pull request Nov 3, 2017

gh-75880: Deadlocks in concurrent.futures.ProcessPoolExecutor with unpickling error #4256

Closed

pitrou closed this Nov 27, 2017

Uh oh!

bpo-30006 More robust concurrent.futures.ProcessPoolExecutor #1013

bpo-30006 More robust concurrent.futures.ProcessPoolExecutor #1013

Uh oh!

Conversation

tomMoral commented Apr 6, 2017 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mention-bot commented Apr 6, 2017

Uh oh!

the-knights-who-say-ni commented Apr 6, 2017

Uh oh!

ogrisel commented Apr 7, 2017

Uh oh!

tomMoral commented Jun 2, 2017

Uh oh!

tomMoral commented Jun 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel Jun 14, 2017

Choose a reason for hiding this comment

Uh oh!

tomMoral Jun 14, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 14, 2017

Choose a reason for hiding this comment

Uh oh!

tomMoral Jun 14, 2017

Choose a reason for hiding this comment

Uh oh!

tomMoral commented Jun 15, 2017

Uh oh!

tomMoral commented Sep 8, 2017

Uh oh!

pitrou commented Sep 20, 2017

Uh oh!

tomMoral commented Sep 21, 2017

Uh oh!

tomMoral commented Oct 5, 2017

Uh oh!

ogrisel commented Nov 27, 2017

Uh oh!

pitrou commented Nov 27, 2017

Uh oh!

Uh oh!

tomMoral commented Apr 6, 2017 •

edited by bedevere-bot

Loading

tomMoral commented Jun 13, 2017 •

edited

Loading