Skip to content

Commit 60d60a2

Browse files
nsrip-ddchojomok
authored andcommitted
refactor(profiling): add memalloc race regression test (#13026)
Add a regression test for races in the memory allocation profiler. The test is marked skip for now, for a few reasons: - It doesn't trigger the crash in a deterministic amount of time, so it's not really reasonable for CI/local dev loop as-is - It probably benefits more from having the thread sanitizer enabled, which we don't currently do for the memalloc extension I'm adding the test so that we have an actual reproducer of the problem that we can easily run ourselves available to any dd-trace-py developers, and have it actually committed somewhere people can find it. It's currently only really useful for local development. I plan to tweak/optimize some of the synchronization code to reduce memalloc overhead, and we need a reliable reproducer of the crashes the synchronization was meant to fix in order to be confident we don't reintroduce them. The test reproduces the crash fixed by #11460, as well as the exception fixed by #12075. Both issues stem from the same problem: at one point, memalloc had no synchronization beyond the GIL protecting its internal state. It turns out that calling back into C Python APIs, as we do when collecting tracebacks, can in some cases lead to the GIL being released. So we need additional synchronization for state modification that straddles C Python API calls. We previously only reliably saw this in a demo program but weren't able to reproduce it locally. Now that I understand the crash much better, I was able to create a standalone reproducer. The key elements are: allocate a lot, trigger GC a lot (including from memalloc traceback collection), and release the GIL during GC. Important note: this only reliably crashes on Python 3.11. The very specific path to releasing the GIL that we hit was modified in 3.12 and later (see python/cpython#97922). We will probably support 3.11 for a while longer, so it's still worth having this test.
1 parent b5478b7 commit 60d60a2

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed

tests/profiling_v2/collector/test_memalloc.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,3 +301,75 @@ def test_heap_profiler_sampling_accuracy(sample_interval):
301301
# probably too generous for low sampling intervals but at least won't be
302302
# flaky.
303303
assert abs(rel - actual_rel) < 0.10
304+
305+
306+
@pytest.mark.skip(reason="too slow, indeterministic")
307+
@pytest.mark.subprocess(
308+
env=dict(
309+
# Turn off other profilers so that we're just testing memalloc
310+
DD_PROFILING_STACK_ENABLED="false",
311+
DD_PROFILING_LOCK_ENABLED="false",
312+
# Upload a lot, since rotating out memalloc profiler state can race with profiling
313+
DD_PROFILING_UPLOAD_INTERVAL="1",
314+
),
315+
)
316+
def test_memealloc_data_race_regression():
317+
import gc
318+
import threading
319+
import time
320+
321+
from ddtrace.profiling import Profiler
322+
323+
gc.enable()
324+
# This threshold is controls when garbage collection is triggered. The
325+
# threshold is on the count of live allocations, which is checked when doing
326+
# a new allocation. This test is ultimately trying to get the allocation of
327+
# frame objects during the memory profiler's traceback function to trigger
328+
# garbage collection. We want a lower threshold to improve the odds that
329+
# this happens.
330+
gc.set_threshold(100)
331+
332+
class Thing:
333+
def __init__(self):
334+
# Self reference so this gets deallocated in GC vs via refcount
335+
self.ref = self
336+
337+
def __del__(self):
338+
# Force GIL yield, so if/when memalloc triggers GC, this is
339+
# deallocated, releasing GIL while memalloc is sampling and allowing
340+
# something else to run and possibly modify memalloc's internal
341+
# state concurrently
342+
time.sleep(0)
343+
344+
def do_alloc():
345+
def f():
346+
return Thing()
347+
348+
return f
349+
350+
def lotsa_allocs(ev):
351+
while not ev.is_set():
352+
f = do_alloc()
353+
f()
354+
time.sleep(0.01)
355+
356+
p = Profiler()
357+
p.start()
358+
359+
threads = []
360+
ev = threading.Event()
361+
for i in range(4):
362+
t = threading.Thread(target=lotsa_allocs, args=(ev,))
363+
t.start()
364+
threads.append(t)
365+
366+
# Arbitrary sleep. This typically crashes in about a minute.
367+
# But for local development, either let it run way longer or
368+
# figure out sanitizer instrumentation
369+
time.sleep(120)
370+
371+
p.stop()
372+
373+
ev.set()
374+
for t in threads:
375+
t.join()

0 commit comments

Comments
 (0)