-
Notifications
You must be signed in to change notification settings - Fork 437
refactor(profiling): add memalloc race regression test #13026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a regression test for races in the memory allocation profiler. The test is marked skip for now, for a few reasons: - It doesn't trigger the crash in a deterministic amount of time, so it's not really reasonable for CI/local dev loop as-is - It probably benefits more from having the thread sanitizer enabled, which we don't currently do for the memalloc extension I'm adding the test so that we have an actual reproducer of the problem that we can easily run ourselves available to any dd-trace-py developers, and have it actually committed somewhere people can find it. It's currently only really useful for local development. I plan to tweak/optimize some of the synchronization code to reduce memalloc overhead, and we need a reliable reproducer of the crashes the synchronization was meant to fix in order to be confident we don't reintroduce them. The test reproduces the crash fixed by #11460, as well as the exception fixed by #12075. Both issues stem from the same problem: at one point, memalloc had no synchronization beyond the GIL protecting its internal state. It turns out that calling back into C Python APIs, as we do when collecting tracebacks, can in some cases lead to the GIL being released. So we need additional synchronization for state modification that straddles C Python API calls. We previously only reliably saw this in a demo program but weren't able to reproduce it locally. Now that I understand the crash much better, I was able to create a standalone reproducer. The key elements are: allocate a lot, trigger GC a lot (including from memalloc traceback collection), and release the GIL during GC. Important note: this only reliably crashes on Python 3.11. The very specific path to releasing the GIL that we hit was modified in 3.12 and later (see python/cpython#97922). We will probably support 3.11 for a while longer, so it's still worth having this test.
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 229 ± 2 ms. The average import time from base is: 231 ± 1 ms. The import time difference between this PR and base is: -1.99 ± 0.08 ms. Import time breakdownThe following import paths have shrunk:
|
BenchmarksBenchmark execution time: 2025-04-02 18:45:40 Comparing candidate commit 999577e in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 497 metrics, 2 unstable metrics. scenario:iast_aspects-ospathbasename_aspect
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In test_memealloc_data_race_regression. Also elaborate on what the test is trying to trigger by sleeping in Thing's __del__ method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to double-check that the tests actually triggers the bug! :) I was running this exact code just in a standalone script with different ddtrace versions, so it's really helpful to verify that it works the same in this test setup.
Add a regression test for races in the memory allocation profiler. The test is marked skip for now, for a few reasons: - It doesn't trigger the crash in a deterministic amount of time, so it's not really reasonable for CI/local dev loop as-is - It probably benefits more from having the thread sanitizer enabled, which we don't currently do for the memalloc extension I'm adding the test so that we have an actual reproducer of the problem that we can easily run ourselves available to any dd-trace-py developers, and have it actually committed somewhere people can find it. It's currently only really useful for local development. I plan to tweak/optimize some of the synchronization code to reduce memalloc overhead, and we need a reliable reproducer of the crashes the synchronization was meant to fix in order to be confident we don't reintroduce them. The test reproduces the crash fixed by #11460, as well as the exception fixed by #12075. Both issues stem from the same problem: at one point, memalloc had no synchronization beyond the GIL protecting its internal state. It turns out that calling back into C Python APIs, as we do when collecting tracebacks, can in some cases lead to the GIL being released. So we need additional synchronization for state modification that straddles C Python API calls. We previously only reliably saw this in a demo program but weren't able to reproduce it locally. Now that I understand the crash much better, I was able to create a standalone reproducer. The key elements are: allocate a lot, trigger GC a lot (including from memalloc traceback collection), and release the GIL during GC. Important note: this only reliably crashes on Python 3.11. The very specific path to releasing the GIL that we hit was modified in 3.12 and later (see python/cpython#97922). We will probably support 3.11 for a while longer, so it's still worth having this test.
Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:
it's not really reasonable for CI/local dev loop as-is
which we don't currently do for the memalloc extension
I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.
The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.
Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.
Checklist
Reviewer Checklist