You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor(profiling): add memalloc race regression test (#13026)
Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:
- It doesn't trigger the crash in a deterministic amount of time, so
it's not really reasonable for CI/local dev loop as-is
- It probably benefits more from having the thread sanitizer enabled,
which we don't currently do for the memalloc extension
I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.
The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.
Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.
0 commit comments