Skip to content

bpo-30006 More robust concurrent.futures.ProcessPoolExecutor #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

tomMoral
Copy link
Contributor

@tomMoral tomMoral commented Apr 6, 2017

This set of patches intends to fix some silent deadlocks that can happened with
the present ProcessPoolExecutor implementation. The deadlock situations
notably include:

  • Pickling and Unpickling errors in task definitions and results.
  • Processes that get killed in non pythonic ways (kill -9 / segfault).

The commit message of each patch gives more details but in essence:

  1. Support for multiprocessing context instead of forcing the use of fork on
    posix systems.
  2. Avoid deadlock when a process dies with the result_queue.wlock acquired.
  3. Deadlock free clean up of failures in the queue_manager_thread and the
    queue_feeder_thread.
  4. Deadlock free termination of workers and threads when the
    ProcessPoolExecutor instance has beeen gc'ed.

The commits are incremental, each one adds new fixes on top of the previous
ones to handle extra deadlock situations. If you think that some changes in the
last commits need more discussion or refinement, let me now so I can separate
them and open different tickets/ PR.

Each commit passes the test suit on its own but the addition of the 3 context
testing makes test_concurrent_futures longer (~140s on my computer). One
fix would be to split the test for each context in seprate files (allowing test
parallelization). I did not do it as it changes the structure of the test suit but
I can implement it if it is necessary.

This work was done as part of the loky project in collaboration with
@ogrisel. It provides a backport of the same features for older versions of
python including 2.7 for legacy users and reusable executors.

https://bugs.python.org/issue30006

@mention-bot
Copy link

@tomMoral, thanks for your PR! By analyzing the history of the files in this pull request, we identified @brianquinlan, @asvetlov and @shibturn to be potential reviewers.

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

Thanks again to your contribution and we look forward to looking at it!

@tomMoral tomMoral force-pushed the PR_robust_executor branch from 47a1046 to 389eb02 Compare April 6, 2017 12:58
@tomMoral tomMoral changed the title More robust concurrent.futures.ProcessPoolExecutor bpo-30006 More robust concurrent.futures.ProcessPoolExecutor Apr 6, 2017
@tomMoral tomMoral force-pushed the PR_robust_executor branch 2 times, most recently from 874f50e to 9859dc5 Compare April 7, 2017 12:34
@tomMoral tomMoral force-pushed the PR_robust_executor branch from 9859dc5 to f8e81e0 Compare April 7, 2017 12:52
@ogrisel
Copy link
Contributor

ogrisel commented Apr 7, 2017

I have a hard time understanding the codecov report on this PR but it seems that the way tests are currently run on travis, it is not possible to collect coverage data in part of the code that is only executed in multiprocessing worker processes.

For loky we had to use advanced coverage configuration to collect such data, see for instance:

Please let us know if you want us to adapt such configuration to better collect coverage data on children process code in the cpython project.

@tomMoral tomMoral force-pushed the PR_robust_executor branch 10 times, most recently from 7f67dfd to e8ce1b6 Compare April 13, 2017 18:13
@tomMoral tomMoral force-pushed the PR_robust_executor branch 2 times, most recently from 3bd48fe to c15c8d6 Compare May 12, 2017 13:44
@tomMoral tomMoral force-pushed the PR_robust_executor branch from c15c8d6 to 1f3bbc5 Compare June 2, 2017 14:13
@tomMoral
Copy link
Contributor Author

tomMoral commented Jun 2, 2017

@pitrou Rebasing this PR on top of the current master broke our deadlock detection test on unpicklable tasks.
See bpo-30414

@tomMoral tomMoral force-pushed the PR_robust_executor branch 2 times, most recently from 79ce609 to 495a2d2 Compare June 12, 2017 13:24
@tomMoral
Copy link
Contributor Author

tomMoral commented Jun 13, 2017

I spotted a design mistake and I forgot to add test for the Queue.
I am on it right now.

@tomMoral tomMoral force-pushed the PR_robust_executor branch from 495a2d2 to 5df6c68 Compare June 13, 2017 15:51
@tomMoral tomMoral force-pushed the PR_robust_executor branch from 5df6c68 to d988a24 Compare June 13, 2017 16:08

class NotSerializable(object):
def __init__(self):
self.pass_test = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be set to False by default, otherwise the test is too easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I set it to False, the test will always fail as I use &= afterward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then to make the test more readable we should have two flags:

  • one named reduce_was_called set to False in __init__ and subsequently set to True in __reduce__
  • another named on_queue_feeder_error_was_called set to False in __init__ and subsequently set to True in _on_queue_feeder_error by the Queue itself if e and obj have the expected type.

And the reverse the order of q.put(unserializable_obj) and q.put(True) to as to make sure that we can make self.assertTrue on the flags after self.assertTrue(q.get(timeout=0.1)) has succeeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did it and also updated the timeout to match the new timeout from #2148

@tomMoral tomMoral force-pushed the PR_robust_executor branch from d988a24 to 6a77782 Compare June 14, 2017 16:22
@tomMoral
Copy link
Contributor Author

I updated this PR in the last few days:

  • I adapted the code base to deal with the unfailing Queue (see bpo30414)
  • I removed the update in coverage to avoid adding unrelated changes.

@tomMoral tomMoral force-pushed the PR_robust_executor branch from 6a77782 to 95b1957 Compare June 26, 2017 10:47
tomMoral/loky#48
* Add context argument to allow non forking ProcessPoolExecutor
* Do some cleaning (pep8+nonused code+naming)
* Liberate the ressource earlier in the `_worker_process`
tomMoral/loky#48
This avoids deadlocks if a Process dies while:
* Unpickling the _CallItem
* Pickling a _ResultItem

Wakeups are done with _Sentinel object that cannot be used with wait.
We do not use a Connection/Queue as it brings lot of overhead in the
Executor to use only a small part of it. We might want to implement a
Sentinel object that can be waited upon to simplify and robustify the
code.

Test no deadlock with crashes
* TST crash in CallItem unpickling
* TST crash in func call run
* TST crash in REsult pickling
the test include crashes with PythonError/SystemExist/SegFault
tomMoral/loky#48
This extra thread checks that the _queue_manager_thread is alive and working.
If not, it permits to avoid deadlocks and raise an appropriate Error. It
also checks that the QueueFeederThread is alive.
Add a _ExecutorFlags object that hold the state of the
ProcessPoolExecutor. This permits to introspect the executor state even
after it has been gc and allow to handle correctly the Errors. It also
introduces a ShutdownExecutorError for jobs that were cancel on shutdown.

Also, this changes the `for` loop on `processes` to while loop to
avoid concurrent dictionary updates errors.
@tomMoral
Copy link
Contributor Author

tomMoral commented Sep 8, 2017

This PR still requires some small fixes (particularly to avoid multiple test run in travis gcc tests).
But feel free to have a look at the diff if you want to start a discussion.

@pitrou
Copy link
Member

pitrou commented Sep 20, 2017

@tomMoral, do you think it would be possible to split this PR into several ones based on the different issues being fixed (or improvements being made)? I think I'll have a hard time reviewing the PR as is, given how delicate multiprocessing (or concurrent.futures) code generally is.

@tomMoral
Copy link
Contributor Author

@pitrou Yes no problem. I created a PR #3682 for the first part which allow to pass a context in the ProcessPoolExecutor constructor. I will create other PR for each part when I get a bit more time but we can start with this one.

@tomMoral
Copy link
Contributor Author

tomMoral commented Oct 5, 2017

@pitrou I created a second PR #3895 for the third commit (I realised the second commit was not necessary). It should fix the interpreter freeze caused by pickling/unpickling errors.

@ogrisel
Copy link
Contributor

ogrisel commented Nov 27, 2017

I think we can close this PR in favor of #3895 and #4256.

@pitrou
Copy link
Member

pitrou commented Nov 27, 2017

Ok, closing.

@pitrou pitrou closed this Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants