Skip to content

[monarch] Fix TLS behavior for async endpoints too #268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

suo
Copy link
Contributor

@suo suo commented Jun 14, 2025

Stack from ghstack (oldest at bottom):

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:

  • We should have two flavors of execution. If there is any async endpoint in the actor, we create an _AsyncActor, otherwise we do a _SyncActor.
  • For an _AsyncActor: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every _AsyncActor will get its own thread to run code in.
  • For a _SyncActor: we schedule all Python code in-line on the handler thread. Each _SyncActor also gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:

  • The control flow is simpler to understand—handle and handle_def are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
  • The error behavior is more correct. Previously, an uncaught error in a background task would be silently dropped, since we never directly awaited the _complete() task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

[ghstack-poisoned]
suo added a commit that referenced this pull request Jun 14, 2025
I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

ghstack-source-id: 290472305
Pull Request resolved: #268
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 14, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

[ghstack-poisoned]
suo added a commit that referenced this pull request Jun 14, 2025
Pull Request resolved: #268

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.
ghstack-source-id: 290472425
@exported-using-ghexport

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

[ghstack-poisoned]
suo added a commit that referenced this pull request Jun 14, 2025
Pull Request resolved: #268

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.
ghstack-source-id: 290473421
@exported-using-ghexport

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

[ghstack-poisoned]
suo added a commit that referenced this pull request Jun 15, 2025
Pull Request resolved: #268

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.
ghstack-source-id: 290573928
@exported-using-ghexport

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

[ghstack-poisoned]
suo added a commit that referenced this pull request Jun 16, 2025
Pull Request resolved: #268

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.
ghstack-source-id: 290829977
@exported-using-ghexport

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

[ghstack-poisoned]
suo added a commit that referenced this pull request Jun 17, 2025
Pull Request resolved: #268

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.
ghstack-source-id: 290883431
@exported-using-ghexport

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:
- We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`.
- For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in.
- For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)

[ghstack-poisoned]
suo added a commit that referenced this pull request Jun 17, 2025
Pull Request resolved: #268

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff fixes that by consolidating all of our execution onto a dedicated thread running a Python event loop. We are careful to not run any Python code (except the initial instantiation of the `_Actor` object, which doesn't run any user code) inside this thread, thus preserving thread locality.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:
- The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
- The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.
ghstack-source-id: 290891150
@exported-using-ghexport

Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/)
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D76661196

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 11f45bb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants