refactor(profiling): add memalloc race regression test #13026

nsrip-dd · 2025-04-02T14:51:36Z

Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:

It doesn't trigger the crash in a deterministic amount of time, so
it's not really reasonable for CI/local dev loop as-is
It probably benefits more from having the thread sanitizer enabled,
which we don't currently do for the memalloc extension

I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.

The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.

Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

Add a regression test for races in the memory allocation profiler. The test is marked skip for now, for a few reasons: - It doesn't trigger the crash in a deterministic amount of time, so it's not really reasonable for CI/local dev loop as-is - It probably benefits more from having the thread sanitizer enabled, which we don't currently do for the memalloc extension I'm adding the test so that we have an actual reproducer of the problem that we can easily run ourselves available to any dd-trace-py developers, and have it actually committed somewhere people can find it. It's currently only really useful for local development. I plan to tweak/optimize some of the synchronization code to reduce memalloc overhead, and we need a reliable reproducer of the crashes the synchronization was meant to fix in order to be confident we don't reintroduce them. The test reproduces the crash fixed by #11460, as well as the exception fixed by #12075. Both issues stem from the same problem: at one point, memalloc had no synchronization beyond the GIL protecting its internal state. It turns out that calling back into C Python APIs, as we do when collecting tracebacks, can in some cases lead to the GIL being released. So we need additional synchronization for state modification that straddles C Python API calls. We previously only reliably saw this in a demo program but weren't able to reproduce it locally. Now that I understand the crash much better, I was able to create a standalone reproducer. The key elements are: allocate a lot, trigger GC a lot (including from memalloc traceback collection), and release the GIL during GC. Important note: this only reliably crashes on Python 3.11. The very specific path to releasing the GIL that we hit was modified in 3.12 and later (see python/cpython#97922). We will probably support 3.11 for a while longer, so it's still worth having this test.

github-actions · 2025-04-02T14:52:38Z

CODEOWNERS have been resolved as:

tests/profiling_v2/collector/test_memalloc.py                           @DataDog/profiling-python

github-actions · 2025-04-02T15:11:20Z

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 229 ± 2 ms.

The average import time from base is: 231 ± 1 ms.

The import time difference between this PR and base is: -1.99 ± 0.08 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.983 ms (0.87%)

ddtrace.bootstrap.sitecustomize 1.289 ms (0.56%)

ddtrace.bootstrap.preload 1.289 ms (0.56%)

ddtrace.internal.products 1.289 ms (0.56%)

ddtrace.internal.remoteconfig.client 0.617 ms (0.27%)

ddtrace 0.694 ms (0.30%)

ddtrace._logger 0.026 ms (0.01%)

logging 0.026 ms (0.01%)

traceback 0.026 ms (0.01%)

contextlib 0.026 ms (0.01%)

pr-commenter · 2025-04-02T15:59:24Z

Benchmarks

Benchmark execution time: 2025-04-02 18:45:40

Comparing candidate commit 999577e in PR branch nick.ripley/memalloc-race-regression-test with baseline commit 4ba8976 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 497 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathbasename_aspect

🟩 execution_time [-331.771ns; -282.407ns] or [-9.262%; -7.884%]

taegyunkim

The test passes on this branch and at the commit 983c84f fix was introduced to main.
Also, the test segfaults on the parent commit of the fix, d855c4a

tests/profiling_v2/collector/test_memalloc.py

In test_memealloc_data_race_regression. Also elaborate on what the test is trying to trigger by sleeping in Thing's __del__ method

nsrip-dd

Thanks for taking the time to double-check that the tests actually triggers the bug! :) I was running this exact code just in a standalone script with different ddtrace versions, so it's really helpful to verify that it works the same in this test setup.

tests/profiling_v2/collector/test_memalloc.py

Add a regression test for races in the memory allocation profiler. The test is marked skip for now, for a few reasons: - It doesn't trigger the crash in a deterministic amount of time, so it's not really reasonable for CI/local dev loop as-is - It probably benefits more from having the thread sanitizer enabled, which we don't currently do for the memalloc extension I'm adding the test so that we have an actual reproducer of the problem that we can easily run ourselves available to any dd-trace-py developers, and have it actually committed somewhere people can find it. It's currently only really useful for local development. I plan to tweak/optimize some of the synchronization code to reduce memalloc overhead, and we need a reliable reproducer of the crashes the synchronization was meant to fix in order to be confident we don't reintroduce them. The test reproduces the crash fixed by #11460, as well as the exception fixed by #12075. Both issues stem from the same problem: at one point, memalloc had no synchronization beyond the GIL protecting its internal state. It turns out that calling back into C Python APIs, as we do when collecting tracebacks, can in some cases lead to the GIL being released. So we need additional synchronization for state modification that straddles C Python API calls. We previously only reliably saw this in a demo program but weren't able to reproduce it locally. Now that I understand the crash much better, I was able to create a standalone reproducer. The key elements are: allocate a lot, trigger GC a lot (including from memalloc traceback collection), and release the GIL during GC. Important note: this only reliably crashes on Python 3.11. The very specific path to releasing the GIL that we hit was modified in 3.12 and later (see python/cpython#97922). We will probably support 3.11 for a while longer, so it's still worth having this test.

nsrip-dd added the changelog/no-changelog A changelog entry is not required for this PR. label Apr 2, 2025

nsrip-dd marked this pull request as ready for review April 2, 2025 15:52

nsrip-dd requested a review from a team as a code owner April 2, 2025 15:52

taegyunkim approved these changes Apr 2, 2025

View reviewed changes

tests/profiling_v2/collector/test_memalloc.py Show resolved Hide resolved

taegyunkim added the Profiling Continous Profling label Apr 2, 2025

refactor(profiling): explain GC threshold in test

999577e

In test_memealloc_data_race_regression. Also elaborate on what the test is trying to trigger by sleeping in Thing's __del__ method

nsrip-dd commented Apr 2, 2025

View reviewed changes

tests/profiling_v2/collector/test_memalloc.py Show resolved Hide resolved

nsrip-dd enabled auto-merge (squash) April 2, 2025 18:00

nsrip-dd merged commit 9bed3a9 into main Apr 2, 2025
290 checks passed

nsrip-dd deleted the nick.ripley/memalloc-race-regression-test branch April 2, 2025 18:52

nsrip-dd mentioned this pull request Apr 15, 2025

refactor(profiling): pool memalloc traceback scratch buffers #13185

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(profiling): add memalloc race regression test #13026

refactor(profiling): add memalloc race regression test #13026

Uh oh!

nsrip-dd commented Apr 2, 2025 •

edited by taegyunkim

Loading

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

github-actions bot commented Apr 2, 2025 •

edited

Loading

Uh oh!

pr-commenter bot commented Apr 2, 2025 •

edited

Loading

Uh oh!

taegyunkim left a comment

Uh oh!

Uh oh!

nsrip-dd left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

refactor(profiling): add memalloc race regression test #13026

refactor(profiling): add memalloc race regression test #13026

Uh oh!

Conversation

nsrip-dd commented Apr 2, 2025 • edited by taegyunkim Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Reviewer Checklist

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

github-actions bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bootstrap import analysis

Summary

Import time breakdown

Uh oh!

pr-commenter bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:iast_aspects-ospathbasename_aspect

Uh oh!

taegyunkim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nsrip-dd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsrip-dd commented Apr 2, 2025 •

edited by taegyunkim

Loading

github-actions bot commented Apr 2, 2025 •

edited

Loading

pr-commenter bot commented Apr 2, 2025 •

edited

Loading