[monarch] Fix TLS behavior for async endpoints too #268

suo · 2025-06-14T15:49:06Z

Stack from ghstack (oldest at bottom):

-> [monarch] Fix TLS behavior for async endpoints too #268

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads.

This diff implements the proposed fix:

We should have two flavors of execution. If there is any async endpoint in the actor, we create an _AsyncActor, otherwise we do a _SyncActor.
For an _AsyncActor: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every _AsyncActor will get its own thread to run code in.
For a _SyncActor: we schedule all Python code in-line on the handler thread. Each _SyncActor also gets its own thread to run code in.

As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages:

The control flow is simpler to understand—handle and handle_def are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust.
The error behavior is more correct. Previously, an uncaught error in a background task would be silently dropped, since we never directly awaited the _complete() task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect.

Differential Revision: D76661196

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads. This diff implements the proposed fix: - We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`. - For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in. - For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in. As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages: - The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust. - The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect. Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/) [ghstack-poisoned]

I noted in D76603819/#267 that the approach I took did not fix TLS across sync and async endpoints, because the python code for those handlers ran in separate threads. This diff implements the proposed fix: - We should have two flavors of execution. If there is any async endpoint in the actor, we create an `_AsyncActor`, otherwise we do a `_SyncActor`. - For an `_AsyncActor`: we schedule all Python code on the asyncio event loop. We get rid of the current behavior where sometimes we run some stuff on the handler thread, sometimes we do it on the event loop. Every `_AsyncActor` will get its own thread to run code in. - For a `_SyncActor`: we schedule all Python code in-line on the handler thread. Each `_SyncActor` *also* gets its own thread to run code in. As a related change, I moved the code for waiting the async endpoints as background tasks from Python to Rust. This has a number of advantages: - The control flow is simpler to understand—`handle` and `handle_def` are regular async functions, not sync functions that return coroutines. Responsibilities for scheduling/awaiting background tasks is no longer split between Python and Rust, it's pure Rust. - The error behavior is more correct. Previously, an uncaught error in a background task would *be silently dropped*, since we never directly awaited the `_complete()` task to get exceptions out. Now these tasks are modeled as hyperactor actors, and any uncaught failure will get propagated as a supervision event. I added some tests to this effect. Differential Revision: [D76661196](https://our.internmc.facebook.com/intern/diff/D76661196/) ghstack-source-id: 290472305 Pull Request resolved: #268

facebook-github-bot · 2025-06-14T15:49:28Z