[BUG]: DD_PROFILING_MEMORY_ENABLED=true slows down application significantly #13307

bmalehorn · 2025-04-30T22:05:22Z

Tracer Version(s)

ddtrace 3.6.0

Python Version(s)

Python 3.10.6

Pip Version(s)

pip 22.3.1

Bug Report

Hello,

We've noticed a significant slowdown in one of our applications in production that seems to be related to memory size.

This application pulls in quite a lot of data, and stores it all in memory. Specifically it fetches about 1 million rows of a table, in batches of 10k at a time, and stores them as ORM objects in a big list. Additionally the slowdown increases over time, with the queries starting out fairly fast, but each query gets slower and slower over time.

I reproduced this in my local environment and ultimately made a 5x speedup by disabling the memory profiler via DD_PROFILING_MEMORY_ENABLED=false - see below for some evidence.

Discussed on the public Datadog Slack here: https://datadoghq.slack.com/archives/C0633E2KH16/p1746038872275079 and in a private message with @taegyunkim.

Org id: 550746 (Lumos)

Reproduction Code

I have a reproduction in my local environment. With memory profiling turned off:

env | grep DD_PROFILING
DD_PROFILING_MEMORY_ENABLED=false
DD_PROFILING_ENABLE_CODE_PROVENANCE=true
DD_PROFILING_ENABLED=true
DD_PROFILING_STACK_V2_ENABLED=true
DD_PROFILING_MAX_FRAMES=256

cd /workflows && ENVIRONMENT=LOCAL /usr/local/bin/pip show ddtrace
Name: ddtrace
Version: 3.6.0

I log how long each batch takes, and each batch runs fairly quickly:

...
elapsed=0.426
elapsed=0.197
elapsed=0.224
...

total time to load 1009015 rows: 108 seconds

When I run with memory profiling enabled:

env | grep DD_PROFILING
DD_PROFILING_MEMORY_ENABLED=true
DD_PROFILING_ENABLE_CODE_PROVENANCE=true
DD_PROFILING_ENABLED=true
DD_PROFILING_STACK_V2_ENABLED=true
DD_PROFILING_MAX_FRAMES=256

cd /workflows && ENVIRONMENT=LOCAL /usr/local/bin/pip show ddtrace
Name: ddtrace
Version: 3.6.0

Running the job, it starts out fast:

21:44:56.047: elapsed=0.256
21:44:56.368: elapsed=0.134
21:44:56.725: elapsed=0.175
...

but after running for several minutes, it's quite slow, and getting slower over time:

21:48:05.580: elapsed=4.788
21:48:14.728: elapsed=5.256
21:48:23.544: elapsed=4.784

total time to load 1009015 rows: 507 seconds

Error Logs

Here are some relevant links for myself & others:

datadog: prod profile
datadog: my run with DD_PROFILING_MEMORY_ENABLED=false individual profile
datadog: my run with DD_PROFILING_MEMORY_ENABLED=true individual profile
[personal] my run with DD_PROFILING_MEMORY_ENABLED=false: http://localhost:8082/runs/8206e632-db85-463a-8662-34e464fa1198
[personal] my run with DD_PROFILING_MEMORY_ENABLED=true: http://localhost:8082/runs/cf9638f7-0c02-4591-a636-c19fb53e91f0
[personal] my branch upgrading to 3.6.0: brian/profiler-perf-all-ddtrace-3.6.0

Libraries in Use

Relevant libraries used in this reproduction:

SQLAlchemy==1.3.16
mysqlclient==2.2.7
ddtrace==3.6.0

Operating System

Linux 8ea13d816b46 6.10.14-linuxkit #1 SMP Fri Nov 29 17:22:03 UTC 2024 aarch64 GNU/Linux

The text was updated successfully, but these errors were encountered:

taegyunkim · 2025-04-30T23:10:37Z

@bmalehorn thanks again for trying different configurations with the latest ddtrace version. It helped us to narrow down the root cause. The heap allocation profiler keeps a list of samples it has seen, and it does O(samples) scan for every free. Probably this is why it's fast for first few batches and it's much slower on the later ones. We will prioritize fixing this issue.

The heap profiler component of memalloc uses a plain array to track allocations. It performs a linear scan of this array on every free to see if the freed allocation is tracked. This means the cost of freeing an allocation increases substantially for programs with large heaps. Switch to an unordered map for tracking allocations. This should have roughly constant overhead even for larger heaps, where we track more allocations. Rather than find or implement a map in C, use C++. We can wrap a std::unordered_map in a thin compatibility layer and use it from the rest of memalloc without having to completely refactor memalloc. Making this work requires some awkward modifications to setup.py to build memalloc. Basically, the Extension class can have sources from multiple languages, but it isn't possible to pass different flags per language, such as the language standard. Instead, we can use a method borrowed from a Stack Overflow answer and use the build_clib command, extended to support compiler arguments. This lets us build a static library, which we can link from memalloc. Note that with the way we build the library, it'll be passed to the link step for other C extensions as well. This should be harmless, though. The other extensions will only include symbols they actually use, which in practice will be none. We'll only have trouble if there are duplicated definitions, but the functions in the map library here are all "namespaced". Note also that we might get slightly worse performance due to the extra cost of map access for very small heaps, compared to using a simple array. TODO - demonstrate the difference in performance with this change for large and small heaps. Fixes #13307

bmalehorn · 2025-05-02T00:45:59Z

No problem!

To confirm more and build confidence to re-enable once fixed, I profiled the C stack frames by running a command like this:

perf record -p <pid> -F 100 -g -- sleep 10

With memory profiling turned on:

I see 77% of time was spent in memalloc_heap_untrack. That definitely sounds like the bug you're describing, and the proposed fix #13317 changes this exact function to use a hash map instead. Great!

And with memory profiling turned off:

memalloc_heap_untrack does not appear in the stack traces.

Versions used in this test:

Name: ddtrace
Version: 2.11.8

nsrip-dd · 2025-05-02T13:51:32Z

Thanks for getting all this data @bmalehorn! Would you be willing to try out the PR with your setup? Here are a few links to get the wheels from CI, depending on which os/architecture you're running on:

https://github.com/DataDog/dd-trace-py/actions/runs/14796031101/artifacts/3050804160 - 3.10, manylinux, x86_64
https://github.com/DataDog/dd-trace-py/actions/runs/14796031101/artifacts/3050801850 - 3.10, manylinux, aarch64
https://github.com/DataDog/dd-trace-py/actions/runs/14796031101/artifacts/3050815073 - 3.10, musllinux, x86_64
https://github.com/DataDog/dd-trace-py/actions/runs/14796031101/artifacts/3050805536 - 3.10, musllinux, aarch64
https://github.com/DataDog/dd-trace-py/actions/runs/14796031101/artifacts/3050796866 - 3.10, macos, arm64
https://github.com/DataDog/dd-trace-py/actions/runs/14796031101/artifacts/3050825449 - 3.10, macos, x86_64

The performance looks good for my local setup, but it'd be great to have confirmation from your setup as well.

bmalehorn · 2025-05-02T19:50:52Z

Hey @nsrip-dd , thanks for the links.

I tried testing this out:

RUN poetry install
+RUN poetry run pip install /common/ddtrace-3.7.0.dev47+g5eaa19d9c-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

root@ce1d6dfda1bb:/opt/dagster/dagster_home# cd /workflows && ENVIRONMENT=LOCAL /usr/local/bin/pip show ddtrace
Name: ddtrace
Version: 3.7.0.dev47+g5eaa19d9c

root@ce1d6dfda1bb:/workflows# env | grep DD_PROF
DD_PROFILING_MEMORY_ENABLED=true
DD_PROFILING_ENABLE_CODE_PROVENANCE=true
DD_PROFILING_ENABLED=true
DD_PROFILING_STACK_V2_ENABLED=true
DD_PROFILING_MAX_FRAMES=256

in the repro:

print("{ddtrace._version.__version__=}")

ddtrace._version.__version__='3.7.0.dev47+g5eaa19d9c'

It does run faster, but now profiles do not show any heap live size information:
https://app.datadoghq.com/profiling/profile/AZaShXG9AAAetiaF3f4MBgAA?query=env:brians-macbook-pro.local&agg_m=count&agg_m_source=base&agg_t=count&event=AwAAAZaShW-fBtX98QAAABhBWmFTaFhHOUFBQWV0aWFGM2Y0TUJnQUEAAAAkMDE5NjkyODUtYTJiOS00NGQ3LTk3YjktNDA1NDIwOTJhNTk4AAAA-g&fromUser=true&my_code=disabled&viz=stream&from_ts=1746211324343&to_ts=1746214924343&live=true

is that expected?

I also tried running with Profiler(_memory_collector_enabled=True) to force memory profiling on, but that didn't have any effect either.

I also checked profiles I got from running on 3.6.0 today. They did contain heap live size:

https://app.datadoghq.com/profiling/profile/AZaSfEkjAAAC958wpR-zsAAA?query=env:brians-macbook-pro.local&agg_m=count&agg_m_source=base&agg_t=count&event=AwAAAZaSfEepMoZ4DwAAABhBWmFTZkVrakFBQUM5NTh3cFItenNBQUEAAAAkMDE5NjkyODItNmJkMy00NzM2LThhZGEtMzc1OGVhNDFiZWY0AAA1ag&fromUser=true&my_code=disabled&viz=stream&from_ts=1746211324343&to_ts=1746214924343&live=true

nsrip-dd · 2025-05-02T20:01:29Z

@bmalehorn Thanks a bunch for trying it out! That is definitely not expected. I suspect that you might be running into issues linking the C++ library at run time (the memalloc import currently fails silently). To confirm, could try import ddtrace.profiling.collector._memalloc? I'm betting that'll give you some errors. We're working on ironing out the issues with the build. The way I've currently implemented this is a bit unusual, mixing C and C++. I'll try to have something ready early next week. Sorry for the trouble!

Edit: I've tried out using a C hash table instead to get rid of the C++ dependency, so if that is the issue you ran into, and you're up for trying another version, this build should work better: wheels-cp310-manylinux_aarch64.

bmalehorn · 2025-05-02T21:05:48Z

@nsrip-dd Ah yep, it was an import error:

>>> import ddtrace
>>> import ddtrace.profiling.collector._memalloc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/ddtrace/internal/module.py", line 239, in _create_module
    return self.loader.create_module(spec)
ImportError: /usr/local/lib/python3.10/site-packages/ddtrace/profiling/collector/_memalloc.cpython-310-aarch64-linux-gnu.so: undefined symbol: __gxx_personality_v0

I tried installing the new wheel:

ENVIRONMENT=LOCAL /usr/local/bin/poetry run pip install /ddtrace-3.8.0.dev10+gcd8620f72-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
root@ce1d6dfda1bb:/workflows# cd /workflows && ENVIRONMENT=LOCAL /usr/local/bin/poetry run python
Skipping virtualenv creation, as specified in config file.
Python 3.10.6 (main, Aug 23 2022, 09:47:38) [GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ddtrace
>>> import ddtrace._version
>>> ddtrace._version.__version__
'3.8.0.dev10+gcd8620f72'
>>> import ddtrace.profiling.collector._memalloc
>>>

Looks like the import is OK.

Running the reproduction, now we do see memory profiles:
https://app.datadoghq.com/profiling/profile/AZaSv4wNAABYWlElQ6jibgAA?query=env:brians-macbook-pro.local&agg_m=count&agg_m_source=base&agg_t=count&event=AwAAAZaSv4lt4XCGIAAAABhBWmFTdjR3TkFBQllXbEVsUTZqaWJnQUEAAAAkMDE5NjkyYmYtYTBlYy00ZjIwLWJhNWItY2MwNjQwNzRlNWU1AAAAcw&fromUser=false&my_code=disabled&profile_type=heap-live-size&viz=stream&from_ts=1746215185576&to_ts=1746218785576&live=true

And it did indeed run faster!

elapsed=0.127
elapsed=0.221
elapsed=0.137
...

total time to load 1009015 rows: 92 seconds

I captured a profile during that time:

Only 0.7% of time was spent in memalloc_heap_untrack, down from 77% I saw earlier. So it definitely seems fixed! The C hash map is more than sufficient to fix this issue, but if you want to go out of your way to use a faster C++ hash map I'm not going to complain.

I'll try to have something ready early next week. Sorry for the trouble!

No rush on our end, we probably are not going to upgrade ddtrace immediately. We're mostly just happy to have identified the issue & have a fix in progress that we'll pick up when we upgrade ddtrace.

bmalehorn · 2025-05-05T16:58:29Z

On a related note about the upgrades, will this bug fix be ported back to ddtrace 2?

nsrip-dd · 2025-05-05T18:05:52Z

On a related note about the upgrades, will this bug fix be ported back to ddtrace 2?

Yeah, we'll backport this to 2.21

…13317) The heap profiler component of memalloc uses a plain array to track allocations. It performs a linear scan of this array on every free to see if the freed allocation is tracked. This means the cost of freeing an allocation increases substantially for programs with large heaps. Switch to a hash table for tracking allocations. This should have roughly constant overhead even for larger heaps, where we track more allocations. This PR uses the C implementation of the [Abseil SwissTable](https://abseil.io/blog/20180927-swisstables) from https://github.com/google/cwisstable. This hash table is designed for memory efficiency, both in terms of memory used (1 byte per entry of metadata) and in terms of access (cache-friendly layout). The table is designed in particular for fast lookup and insertion. We do a lookup for every single free, so it's important that it's fast. This PR previously tried to use a C++ `std::unordered_map`, but it proved far too difficult to integrate with the existing C `memalloc` implementation. This PR includes the single `cwisstable.h` header with the following modifications, which can be found in the header by searching for `BEGIN MODIFICATION`: - Fix `CWISS_Mul128`: for 64-bit windows we need a different intrinsic, and for 32-bit systems we can't use it at all - Fix a multiple declaration and invalid C++ syntax in `CWISS_LeadingZeroes64` for 32-bit Windows The PR also supplies a 32-bit compatible hash function since the one in `cwisstable.h` couldn't be easily made to work. We use `xxHash` instead. Note that `cwisstable.h` requires the hash function to return a `size_t`, which is 32 bits on 32-bit platforms, but the SwissTable design expects 64-bit hashes. I don't know if this is a huge problem in practice or if people even use our profiler on 32-bit systems, but it's something to watch out for. Note that the `cwisstable.h` hash table implementation basically never gets rid of backing memory unless we completely clear the table. This is not a big deal in practice. Each entry in the table is an 8 byte `void*` key and an 8 byte `traceback_t*`, plus 1 byte of control metadata. If we assume a target load factor of 50%, and we keep the existing `2^16` element cap, then a completely full table is only about 2MiB. Most of the heap profiler memory usage comes from the tracebacks themselves, which we _do_ free when they're removed from the profiler structures. This PR also fixes a potential memory leak. Since we use try-locks, it theoretically possible to fail to un-track an allocation if we can't acquire the lock. When this happens, the previous implementation would just leak an entry. With the hash table, if we find an existing entry for a given address, we can free the old traceback and replace it with the new one. The real fix will be to fix how we handle locking, but this PR improves the situation. I've tested this locally and see very good performance even for larger heaps, and this has also been tested in #13307. Fixes #13307

…backport 2.21] (#13317) The heap profiler component of memalloc uses a plain array to track allocations. It performs a linear scan of this array on every free to see if the freed allocation is tracked. This means the cost of freeing an allocation increases substantially for programs with large heaps. Switch to a hash table for tracking allocations. This should have roughly constant overhead even for larger heaps, where we track more allocations. This PR uses the C implementation of the [Abseil SwissTable](https://abseil.io/blog/20180927-swisstables) from https://github.com/google/cwisstable. This hash table is designed for memory efficiency, both in terms of memory used (1 byte per entry of metadata) and in terms of access (cache-friendly layout). The table is designed in particular for fast lookup and insertion. We do a lookup for every single free, so it's important that it's fast. This PR previously tried to use a C++ `std::unordered_map`, but it proved far too difficult to integrate with the existing C `memalloc` implementation. This PR includes the single `cwisstable.h` header with the following modifications, which can be found in the header by searching for `BEGIN MODIFICATION`: - Fix `CWISS_Mul128`: for 64-bit windows we need a different intrinsic, and for 32-bit systems we can't use it at all - Fix a multiple declaration and invalid C++ syntax in `CWISS_LeadingZeroes64` for 32-bit Windows The PR also supplies a 32-bit compatible hash function since the one in `cwisstable.h` couldn't be easily made to work. We use `xxHash` instead. Note that `cwisstable.h` requires the hash function to return a `size_t`, which is 32 bits on 32-bit platforms, but the SwissTable design expects 64-bit hashes. I don't know if this is a huge problem in practice or if people even use our profiler on 32-bit systems, but it's something to watch out for. Note that the `cwisstable.h` hash table implementation basically never gets rid of backing memory unless we completely clear the table. This is not a big deal in practice. Each entry in the table is an 8 byte `void*` key and an 8 byte `traceback_t*`, plus 1 byte of control metadata. If we assume a target load factor of 50%, and we keep the existing `2^16` element cap, then a completely full table is only about 2MiB. Most of the heap profiler memory usage comes from the tracebacks themselves, which we _do_ free when they're removed from the profiler structures. This PR also fixes a potential memory leak. Since we use try-locks, it theoretically possible to fail to un-track an allocation if we can't acquire the lock. When this happens, the previous implementation would just leak an entry. With the hash table, if we find an existing entry for a given address, we can free the old traceback and replace it with the new one. The real fix will be to fix how we handle locking, but this PR improves the situation. I've tested this locally and see very good performance even for larger heaps, and this has also been tested in #13307. Fixes #13307

…backport 2.21] (#13317) (#13350) Backports #13317 to 2.21, including the fix from #13353 The heap profiler component of memalloc uses a plain array to track allocations. It performs a linear scan of this array on every free to see if the freed allocation is tracked. This means the cost of freeing an allocation increases substantially for programs with large heaps. Switch to a hash table for tracking allocations. This should have roughly constant overhead even for larger heaps, where we track more allocations. This PR uses the C implementation of the [Abseil SwissTable](https://abseil.io/blog/20180927-swisstables) from https://github.com/google/cwisstable. This hash table is designed for memory efficiency, both in terms of memory used (1 byte per entry of metadata) and in terms of access (cache-friendly layout). The table is designed in particular for fast lookup and insertion. We do a lookup for every single free, so it's important that it's fast. This PR previously tried to use a C++ `std::unordered_map`, but it proved far too difficult to integrate with the existing C `memalloc` implementation. This PR includes the single `cwisstable.h` header with the following modifications, which can be found in the header by searching for `BEGIN MODIFICATION`: - Fix `CWISS_Mul128`: for 64-bit windows we need a different intrinsic, and for 32-bit systems we can't use it at all - Fix a multiple declaration and invalid C++ syntax in `CWISS_LeadingZeroes64` for 32-bit Windows The PR also supplies a 32-bit compatible hash function since the one in `cwisstable.h` couldn't be easily made to work. We use `xxHash` instead. Note that `cwisstable.h` requires the hash function to return a `size_t`, which is 32 bits on 32-bit platforms, but the SwissTable design expects 64-bit hashes. I don't know if this is a huge problem in practice or if people even use our profiler on 32-bit systems, but it's something to watch out for. Note that the `cwisstable.h` hash table implementation basically never gets rid of backing memory unless we completely clear the table. This is not a big deal in practice. Each entry in the table is an 8 byte `void*` key and an 8 byte `traceback_t*`, plus 1 byte of control metadata. If we assume a target load factor of 50%, and we keep the existing `2^16` element cap, then a completely full table is only about 2MiB. Most of the heap profiler memory usage comes from the tracebacks themselves, which we _do_ free when they're removed from the profiler structures. This PR also fixes a potential memory leak. Since we use try-locks, it theoretically possible to fail to un-track an allocation if we can't acquire the lock. When this happens, the previous implementation would just leak an entry. With the hash table, if we find an existing entry for a given address, we can free the old traceback and replace it with the new one. The real fix will be to fix how we handle locking, but this PR improves the situation. I've tested this locally and see very good performance even for larger heaps, and this has also been tested in #13307. Fixes #13307 ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

bmalehorn added the bug label Apr 30, 2025

taegyunkim assigned nsrip-dd and taegyunkim Apr 30, 2025

nsrip-dd mentioned this issue May 1, 2025

perf(profiling): improve scaling of memory profiler for large heaps #13317

Merged

2 tasks

taegyunkim added the Profiling Continous Profling label May 2, 2025

nsrip-dd closed this as completed in #13317 May 7, 2025

nsrip-dd mentioned this issue May 7, 2025

perf(profiling): improve scaling of memory profiler for large heaps [backport 2.21] (#13317) #13350

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: DD_PROFILING_MEMORY_ENABLED=true slows down application significantly #13307

[BUG]: DD_PROFILING_MEMORY_ENABLED=true slows down application significantly #13307

bmalehorn commented Apr 30, 2025 •

edited

Loading

taegyunkim commented Apr 30, 2025 •

edited

Loading

Uh oh!

bmalehorn commented May 2, 2025 •

edited

Loading

Uh oh!

nsrip-dd commented May 2, 2025

Uh oh!

bmalehorn commented May 2, 2025

Uh oh!

nsrip-dd commented May 2, 2025 •

edited

Loading

Uh oh!

bmalehorn commented May 2, 2025

Uh oh!

bmalehorn commented May 5, 2025

Uh oh!

nsrip-dd commented May 5, 2025

Uh oh!

[BUG]: DD_PROFILING_MEMORY_ENABLED=true slows down application significantly #13307

[BUG]: DD_PROFILING_MEMORY_ENABLED=true slows down application significantly #13307

Comments

bmalehorn commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracer Version(s)

Python Version(s)

Pip Version(s)

Bug Report

Reproduction Code

Error Logs

Libraries in Use

Operating System

taegyunkim commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmalehorn commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsrip-dd commented May 2, 2025

Uh oh!

bmalehorn commented May 2, 2025

Uh oh!

nsrip-dd commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmalehorn commented May 2, 2025

Uh oh!

bmalehorn commented May 5, 2025

Uh oh!

nsrip-dd commented May 5, 2025

Uh oh!

bmalehorn commented Apr 30, 2025 •

edited

Loading

taegyunkim commented Apr 30, 2025 •

edited

Loading

bmalehorn commented May 2, 2025 •

edited

Loading

nsrip-dd commented May 2, 2025 •

edited

Loading