Skip to content

refactor(profiling): add memalloc race regression test #13026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 2, 2025

Conversation

nsrip-dd
Copy link
Contributor

@nsrip-dd nsrip-dd commented Apr 2, 2025

Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:

  • It doesn't trigger the crash in a deterministic amount of time, so
    it's not really reasonable for CI/local dev loop as-is
  • It probably benefits more from having the thread sanitizer enabled,
    which we don't currently do for the memalloc extension

I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.

The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.

Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:

- It doesn't trigger the crash in a deterministic amount of time, so
  it's not really reasonable for CI/local dev loop as-is
- It probably benefits more from having the thread sanitizer enabled,
  which we don't currently do for the memalloc extension

I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.

The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.

Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.
@nsrip-dd nsrip-dd added the changelog/no-changelog A changelog entry is not required for this PR. label Apr 2, 2025
Copy link
Contributor

github-actions bot commented Apr 2, 2025

CODEOWNERS have been resolved as:

tests/profiling_v2/collector/test_memalloc.py                           @DataDog/profiling-python

Copy link
Contributor

github-actions bot commented Apr 2, 2025

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 229 ± 2 ms.

The average import time from base is: 231 ± 1 ms.

The import time difference between this PR and base is: -1.99 ± 0.08 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.983 ms (0.87%)
ddtrace.bootstrap.sitecustomize 1.289 ms (0.56%)
ddtrace.bootstrap.preload 1.289 ms (0.56%)
ddtrace.internal.products 1.289 ms (0.56%)
ddtrace.internal.remoteconfig.client 0.617 ms (0.27%)
ddtrace 0.694 ms (0.30%)
ddtrace._logger 0.026 ms (0.01%)
logging 0.026 ms (0.01%)
traceback 0.026 ms (0.01%)
contextlib 0.026 ms (0.01%)

@nsrip-dd nsrip-dd marked this pull request as ready for review April 2, 2025 15:52
@nsrip-dd nsrip-dd requested a review from a team as a code owner April 2, 2025 15:52
@pr-commenter
Copy link

pr-commenter bot commented Apr 2, 2025

Benchmarks

Benchmark execution time: 2025-04-02 18:45:40

Comparing candidate commit 999577e in PR branch nick.ripley/memalloc-race-regression-test with baseline commit 4ba8976 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 497 metrics, 2 unstable metrics.

scenario:iast_aspects-ospathbasename_aspect

  • 🟩 execution_time [-331.771ns; -282.407ns] or [-9.262%; -7.884%]

Copy link
Contributor

@taegyunkim taegyunkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test passes on this branch and at the commit 983c84f fix was introduced to main.
Also, the test segfaults on the parent commit of the fix, d855c4a

@taegyunkim taegyunkim added the Profiling Continous Profling label Apr 2, 2025
In test_memealloc_data_race_regression. Also elaborate on what the test
is trying to trigger by sleeping in Thing's __del__ method
Copy link
Contributor Author

@nsrip-dd nsrip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to double-check that the tests actually triggers the bug! :) I was running this exact code just in a standalone script with different ddtrace versions, so it's really helpful to verify that it works the same in this test setup.

@nsrip-dd nsrip-dd enabled auto-merge (squash) April 2, 2025 18:00
@nsrip-dd nsrip-dd merged commit 9bed3a9 into main Apr 2, 2025
290 checks passed
@nsrip-dd nsrip-dd deleted the nick.ripley/memalloc-race-regression-test branch April 2, 2025 18:52
chojomok pushed a commit that referenced this pull request Apr 7, 2025
Add a regression test for races in the memory allocation profiler. The
test is marked skip for now, for a few reasons:

- It doesn't trigger the crash in a deterministic amount of time, so
  it's not really reasonable for CI/local dev loop as-is
- It probably benefits more from having the thread sanitizer enabled,
  which we don't currently do for the memalloc extension

I'm adding the test so that we have an actual reproducer of the problem
that we can easily run ourselves available to any dd-trace-py
developers, and have it actually committed somewhere people can find it.
It's currently only really useful for local development. I plan to
tweak/optimize some of the synchronization code to reduce memalloc
overhead, and we need a reliable reproducer of the crashes the
synchronization was meant to fix in order to be confident we don't
reintroduce them.

The test reproduces the crash fixed by #11460, as well as the exception
fixed by #12075. Both issues stem from the same problem: at one point,
memalloc had no synchronization beyond the GIL protecting its internal
state. It turns out that calling back into C Python APIs, as we do when
collecting tracebacks, can in some cases lead to the GIL being released.
So we need additional synchronization for state modification that
straddles C Python API calls. We previously only reliably saw this in a
demo program but weren't able to reproduce it locally. Now that I
understand the crash much better, I was able to create a standalone
reproducer. The key elements are: allocate a lot, trigger GC a lot
(including from memalloc traceback collection), and release the GIL
during GC.

Important note: this only reliably crashes on Python 3.11. The very
specific path to releasing the GIL that we hit was modified in 3.12 and
later (see python/cpython#97922). We will
probably support 3.11 for a while longer, so it's still worth having
this test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants