Skip to content

Commit bb24ee3

Browse files
authored
perf(profiling): improve scaling of memory profiler for large heaps (#13317)
The heap profiler component of memalloc uses a plain array to track allocations. It performs a linear scan of this array on every free to see if the freed allocation is tracked. This means the cost of freeing an allocation increases substantially for programs with large heaps. Switch to a hash table for tracking allocations. This should have roughly constant overhead even for larger heaps, where we track more allocations. This PR uses the C implementation of the [Abseil SwissTable](https://abseil.io/blog/20180927-swisstables) from https://github.com/google/cwisstable. This hash table is designed for memory efficiency, both in terms of memory used (1 byte per entry of metadata) and in terms of access (cache-friendly layout). The table is designed in particular for fast lookup and insertion. We do a lookup for every single free, so it's important that it's fast. This PR previously tried to use a C++ `std::unordered_map`, but it proved far too difficult to integrate with the existing C `memalloc` implementation. This PR includes the single `cwisstable.h` header with the following modifications, which can be found in the header by searching for `BEGIN MODIFICATION`: - Fix `CWISS_Mul128`: for 64-bit windows we need a different intrinsic, and for 32-bit systems we can't use it at all - Fix a multiple declaration and invalid C++ syntax in `CWISS_LeadingZeroes64` for 32-bit Windows The PR also supplies a 32-bit compatible hash function since the one in `cwisstable.h` couldn't be easily made to work. We use `xxHash` instead. Note that `cwisstable.h` requires the hash function to return a `size_t`, which is 32 bits on 32-bit platforms, but the SwissTable design expects 64-bit hashes. I don't know if this is a huge problem in practice or if people even use our profiler on 32-bit systems, but it's something to watch out for. Note that the `cwisstable.h` hash table implementation basically never gets rid of backing memory unless we completely clear the table. This is not a big deal in practice. Each entry in the table is an 8 byte `void*` key and an 8 byte `traceback_t*`, plus 1 byte of control metadata. If we assume a target load factor of 50%, and we keep the existing `2^16` element cap, then a completely full table is only about 2MiB. Most of the heap profiler memory usage comes from the tracebacks themselves, which we _do_ free when they're removed from the profiler structures. This PR also fixes a potential memory leak. Since we use try-locks, it theoretically possible to fail to un-track an allocation if we can't acquire the lock. When this happens, the previous implementation would just leak an entry. With the hash table, if we find an existing entry for a given address, we can free the old traceback and replace it with the new one. The real fix will be to fix how we handle locking, but this PR improves the situation. I've tested this locally and see very good performance even for larger heaps, and this has also been tested in #13307. Fixes #13307
1 parent 57cf21f commit bb24ee3

File tree

6 files changed

+3552
-61
lines changed

6 files changed

+3552
-61
lines changed

ddtrace/profiling/collector/_memalloc_heap.c

Lines changed: 40 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
#define PY_SSIZE_T_CLEAN
55
#include "_memalloc_heap.h"
6+
#include "_memalloc_heap_map.h"
67
#include "_memalloc_reentrant.h"
78
#include "_memalloc_tb.h"
89

@@ -66,15 +67,15 @@ typedef struct
6667
/* Next heap sample target, in bytes allocated */
6768
uint64_t current_sample_size;
6869
/* Tracked allocations */
69-
traceback_array_t allocs;
70+
memalloc_heap_map_t* allocs_m;
7071
/* Bytes allocated since the last sample was collected */
7172
uint64_t allocated_memory;
7273
/* True if the heap tracker is frozen */
7374
bool frozen;
7475
/* Contains the ongoing heap allocation/deallocation while frozen */
7576
struct
7677
{
77-
traceback_array_t allocs;
78+
memalloc_heap_map_t* allocs_m;
7879
ptr_array_t frees;
7980
} freezer;
8081
} heap_tracker_t;
@@ -146,8 +147,8 @@ heap_tracker_next_sample_size(uint32_t sample_size)
146147
static void
147148
heap_tracker_init(heap_tracker_t* heap_tracker)
148149
{
149-
traceback_array_init(&heap_tracker->allocs);
150-
traceback_array_init(&heap_tracker->freezer.allocs);
150+
heap_tracker->allocs_m = memalloc_heap_map_new();
151+
heap_tracker->freezer.allocs_m = memalloc_heap_map_new();
151152
ptr_array_init(&heap_tracker->freezer.frees);
152153
heap_tracker->allocated_memory = 0;
153154
heap_tracker->frozen = false;
@@ -158,8 +159,8 @@ heap_tracker_init(heap_tracker_t* heap_tracker)
158159
static void
159160
heap_tracker_wipe(heap_tracker_t* heap_tracker)
160161
{
161-
traceback_array_wipe(&heap_tracker->allocs);
162-
traceback_array_wipe(&heap_tracker->freezer.allocs);
162+
memalloc_heap_map_delete(heap_tracker->allocs_m);
163+
memalloc_heap_map_delete(heap_tracker->freezer.allocs_m);
163164
ptr_array_wipe(&heap_tracker->freezer.frees);
164165
}
165166

@@ -172,48 +173,26 @@ heap_tracker_freeze(heap_tracker_t* heap_tracker)
172173
static void
173174
heap_tracker_untrack_thawed(heap_tracker_t* heap_tracker, void* ptr)
174175
{
175-
/* This search is O(n) where `n` is the number of tracked traceback,
176-
which is linearly linked to the heap size. This search could probably be
177-
optimized in a couple of ways:
178-
179-
- sort the traceback in allocs by ptr so we can find the ptr in O(log2 n)
180-
- use a Bloom filter?
181-
182-
That being said, we start iterating at the end of the array because most
183-
of the time this is where the untracked ptr is (the most recent object
184-
get de-allocated first usually). This might be a good enough
185-
trade-off. */
186-
for (TRACEBACK_ARRAY_COUNT_TYPE i = heap_tracker->allocs.count; i > 0; i--) {
187-
traceback_t** tb = &heap_tracker->allocs.tab[i - 1];
188-
189-
if (ptr == (*tb)->ptr) {
190-
/* Free the traceback */
191-
traceback_free(*tb);
192-
traceback_array_remove(&heap_tracker->allocs, tb);
193-
break;
194-
}
176+
traceback_t* tb = memalloc_heap_map_remove(heap_tracker->allocs_m, ptr);
177+
if (tb) {
178+
traceback_free(tb);
195179
}
196180
}
197181

198182
static void
199183
heap_tracker_thaw(heap_tracker_t* heap_tracker)
200184
{
201-
/* Add the frozen allocs at the end */
202-
traceback_array_splice(&heap_tracker->allocs,
203-
heap_tracker->allocs.count,
204-
0,
205-
heap_tracker->freezer.allocs.tab,
206-
heap_tracker->freezer.allocs.count);
185+
memalloc_heap_map_destructive_copy(heap_tracker->allocs_m, heap_tracker->freezer.allocs_m);
207186

208187
/* Handle the frees: we need to handle the frees after we merge the allocs
209188
array together to be sure that there's no free in the freezer matching
210189
an alloc that is also in the freezer; heap_tracker_untrack_thawed does
211190
not care about the freezer, by definition. */
212-
for (MEMALLOC_HEAP_PTR_ARRAY_COUNT_TYPE i = 0; i < heap_tracker->freezer.frees.count; i++)
191+
for (MEMALLOC_HEAP_PTR_ARRAY_COUNT_TYPE i = 0; i < heap_tracker->freezer.frees.count; i++) {
213192
heap_tracker_untrack_thawed(heap_tracker, heap_tracker->freezer.frees.tab[i]);
193+
}
214194

215195
/* Reset the count to zero so we can reused the array and overwrite previous values */
216-
heap_tracker->freezer.allocs.count = 0;
217196
heap_tracker->freezer.frees.count = 0;
218197

219198
heap_tracker->frozen = false;
@@ -249,16 +228,21 @@ memalloc_heap_untrack(void* ptr)
249228
return;
250229
}
251230
if (global_heap_tracker.frozen) {
252-
/* Check that we still have space to store the free. If we don't have
253-
enough space, we ignore the untrack. That's sad as there is a change
254-
the heap profile won't be valid anymore. However, that's the best we
255-
can do since reporting an error is not an option here. What's gonna
256-
free more than 2^64 pointers anyway?!
257-
*/
258-
if (global_heap_tracker.freezer.frees.count < MEMALLOC_HEAP_PTR_ARRAY_MAX_COUNT)
231+
/* NB: because we also take the lock in memalloc_heap, we're not going to
232+
* take this branch. We will, however, if we start using the GIL as our
233+
* lock instead.
234+
*/
235+
traceback_t* tb = memalloc_heap_map_remove(global_heap_tracker.freezer.allocs_m, ptr);
236+
if (tb) {
237+
traceback_free(tb);
238+
} else if (memalloc_heap_map_contains(global_heap_tracker.allocs_m, ptr)) {
239+
/* We're tracking this pointer but can't remove it right now because
240+
* we're iterating over the map. Save the pointer to remove later */
259241
ptr_array_append(&global_heap_tracker.freezer.frees, ptr);
260-
} else
242+
}
243+
} else {
261244
heap_tracker_untrack_thawed(&global_heap_tracker, ptr);
245+
}
262246

263247
memlock_unlock(&g_memheap_lock);
264248
}
@@ -289,10 +273,14 @@ memalloc_heap_track(uint16_t max_nframe, void* ptr, size_t size, PyMemAllocatorD
289273
return false;
290274
}
291275

292-
/* Check if we can add more samples: the sum of the freezer + alloc tracker
293-
cannot be greater than what the alloc tracker can handle: when the alloc
294-
tracker is thawed, all the allocs in the freezer will be moved there!*/
295-
if (global_heap_tracker.freezer.allocs.count + global_heap_tracker.allocs.count >= TRACEBACK_ARRAY_MAX_COUNT) {
276+
if (memalloc_heap_map_size(global_heap_tracker.allocs_m) +
277+
memalloc_heap_map_size(global_heap_tracker.freezer.allocs_m) >
278+
TRACEBACK_ARRAY_MAX_COUNT) {
279+
/* TODO(nick) this is vestigial from the original array-based
280+
* implementation. Do we actually want this? It gives us bounded memory
281+
* use, but the size limit is arbitrary and once we hit the arbitrary
282+
* limit our reported numbers will be inaccurate.
283+
*/
296284
memlock_unlock(&g_memheap_lock);
297285
return false;
298286
}
@@ -310,10 +298,11 @@ memalloc_heap_track(uint16_t max_nframe, void* ptr, size_t size, PyMemAllocatorD
310298
of sample live allocations stays close to the actual heap size */
311299
traceback_t* tb = memalloc_get_traceback(max_nframe, ptr, global_heap_tracker.allocated_memory, domain);
312300
if (tb) {
313-
if (global_heap_tracker.frozen)
314-
traceback_array_append(&global_heap_tracker.freezer.allocs, tb);
315-
else
316-
traceback_array_append(&global_heap_tracker.allocs, tb);
301+
if (global_heap_tracker.frozen) {
302+
memalloc_heap_map_insert(global_heap_tracker.freezer.allocs_m, tb->ptr, tb);
303+
} else {
304+
memalloc_heap_map_insert(global_heap_tracker.allocs_m, tb->ptr, tb);
305+
}
317306

318307
/* Reset the counter to 0 */
319308
global_heap_tracker.allocated_memory = 0;
@@ -340,16 +329,7 @@ memalloc_heap()
340329

341330
heap_tracker_freeze(&global_heap_tracker);
342331

343-
PyObject* heap_list = PyList_New(global_heap_tracker.allocs.count);
344-
345-
for (TRACEBACK_ARRAY_COUNT_TYPE i = 0; i < global_heap_tracker.allocs.count; i++) {
346-
traceback_t* tb = global_heap_tracker.allocs.tab[i];
347-
348-
PyObject* tb_and_size = PyTuple_New(2);
349-
PyTuple_SET_ITEM(tb_and_size, 0, traceback_to_tuple(tb));
350-
PyTuple_SET_ITEM(tb_and_size, 1, PyLong_FromSize_t(tb->size));
351-
PyList_SET_ITEM(heap_list, i, tb_and_size);
352-
}
332+
PyObject* heap_list = memalloc_heap_map_export(global_heap_tracker.allocs_m);
353333

354334
heap_tracker_thaw(&global_heap_tracker);
355335

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
#include <stdlib.h>
2+
3+
#include <Python.h>
4+
5+
#include "_memalloc_tb.h"
6+
#include "vendor/cwisstable.h"
7+
8+
/* cwisstable.h provides a C implementation of SwissTables hash maps, originally
9+
* implemented in the Abseil C++ library.
10+
*
11+
* This header is was generated from https://github.com/google/cwisstable
12+
* at commit 6de0e5f2e55f90017534a3366198ce7d3e3b7fef
13+
* and lightly modified to compile for Windows and 32-bit platforms we support.
14+
* See "BEGIN MODIFICATION" and "END MODIFICATION" in the header.
15+
*
16+
* The following macro will expand to a type-safe implementation with void* keys
17+
* and traceback_t* values for use by the heap profiler. We encapsulate this
18+
* implementation in a wrapper specialized for use by the heap profiler, both to
19+
* keep compilation fast (the cwisstables header is big) and to allow us to swap
20+
* out the implementation if we want.
21+
*
22+
* Note that the HeapSample tables will, in general, never free their backing
23+
* memory unless we completely clear them. The table takes 17 bytes per entry: 8
24+
* for the void* keys, 8 for the traceback* values, and 1 byte per entry for
25+
* control metadata. Assuming a load factor target of ~50%, meaning our table
26+
* has roughly twice as many slots as actual entries, then for our default
27+
* maximum of 2^16 entries the table will be about 2MiB. A table this large
28+
* would correspond to a program with a ~65GiB live heap with a 1MiB default
29+
* sampling interval. Most of the memory usage of the profiler will come from
30+
* the tracebacks themselves, which we _do_ free when we're done with them.
31+
*/
32+
#if defined(_WIN_64) || defined(__x86_64__) || defined(__aarch_64__)
33+
CWISS_DECLARE_FLAT_HASHMAP(HeapSamples, void*, traceback_t*);
34+
#else
35+
/* The default cwisstable hash relies on full-width 64-bit
36+
* multiplication, which is really slow on 32-bit.
37+
* For 32-bit, we define a custom hash function with reasonable quality.
38+
* Derived from:
39+
* https://github.com/Cyan4973/xxHash/blob/dev/doc/xxhash_spec.md#xxh32-algorithm-description.
40+
*
41+
* NOTE: cwisstable.h requires the hash function to return a size_t.
42+
* On 32-bit platforms this is 32 bits, while the SwissTable design
43+
* expects 64-bit hashes, with 7 of the bits are used for metadata.
44+
* So we get much lower entropy on 32-bit platforms.
45+
*/
46+
static size_t
47+
void_ptr_hash(const void* value)
48+
{
49+
#define PRIME32_1 0x9E3779B1U
50+
#define PRIME32_2 0x85EBCA77U
51+
#define PRIME32_3 0xC2B2AE3DU
52+
#define PRIME32_4 0x27D4EB2FU
53+
#define PRIME32_5 0x165667B1U
54+
55+
/* "Special case: input is less than 16 bytes".
56+
* Here our seed is fixed at 0 so we elide it */
57+
uint32_t acc = PRIME32_5;
58+
59+
/* "Input length" is the size of a pointer. */
60+
acc = acc + sizeof(void*);
61+
62+
/* "Consume remaining input".
63+
* Here we know our input is just 4 bytes, the size of a pointer */
64+
uint32_t lane = *((uint32_t*)value);
65+
acc += lane * PRIME32_3;
66+
acc = (acc << 17) * PRIME32_4;
67+
68+
acc ^= (acc >> 15);
69+
acc *= PRIME32_2;
70+
acc ^= (acc >> 13);
71+
acc *= PRIME32_3;
72+
acc ^= (acc >> 16);
73+
return acc;
74+
}
75+
CWISS_DECLARE_FLAT_MAP_POLICY(HeapSamples_policy32, void*, traceback_t*, (key_hash, void_ptr_hash));
76+
CWISS_DECLARE_HASHMAP_WITH(HeapSamples, void*, traceback_t*, HeapSamples_policy32);
77+
#endif
78+
79+
typedef struct memalloc_heap_map_t
80+
{
81+
HeapSamples map;
82+
} memalloc_heap_map_t;
83+
84+
memalloc_heap_map_t*
85+
memalloc_heap_map_new()
86+
{
87+
memalloc_heap_map_t* m = calloc(sizeof(memalloc_heap_map_t), 1);
88+
m->map = HeapSamples_new(0);
89+
return m;
90+
}
91+
92+
size_t
93+
memalloc_heap_map_size(memalloc_heap_map_t* m)
94+
{
95+
return HeapSamples_size(&m->map);
96+
}
97+
98+
void
99+
memalloc_heap_map_insert(memalloc_heap_map_t* m, void* key, traceback_t* value)
100+
{
101+
HeapSamples_Entry k = { key = key, value = value };
102+
HeapSamples_Insert res = HeapSamples_insert(&m->map, &k);
103+
if (!res.inserted) {
104+
/* This shouldn't happen, but due to the current try-locking implementation
105+
* we can hypothetically fail to remove a sample, leading to two entries
106+
* with the same pointer. If we see this key already, then the allocation
107+
* for the existing entry was freed. Replace it with the new one
108+
*/
109+
HeapSamples_Entry* e = HeapSamples_Iter_get(&res.iter);
110+
traceback_free(e->val);
111+
e->val = value;
112+
}
113+
}
114+
115+
bool
116+
memalloc_heap_map_contains(memalloc_heap_map_t* m, void* key)
117+
{
118+
return HeapSamples_contains(&m->map, &key);
119+
}
120+
121+
traceback_t*
122+
memalloc_heap_map_remove(memalloc_heap_map_t* m, void* key)
123+
{
124+
traceback_t* res = NULL;
125+
HeapSamples_Iter it = HeapSamples_find(&m->map, &key);
126+
HeapSamples_Entry* e = HeapSamples_Iter_get(&it);
127+
if (e != NULL) {
128+
res = e->val;
129+
/* This erases the entry but won't shrink the table. */
130+
HeapSamples_erase_at(it);
131+
}
132+
return res;
133+
}
134+
135+
PyObject*
136+
memalloc_heap_map_export(memalloc_heap_map_t* m)
137+
{
138+
PyObject* heap_list = PyList_New(HeapSamples_size(&m->map));
139+
if (heap_list == NULL) {
140+
return NULL;
141+
}
142+
143+
int i = 0;
144+
HeapSamples_CIter it = HeapSamples_citer(&m->map);
145+
for (const HeapSamples_Entry* e = HeapSamples_CIter_get(&it); e != NULL; e = HeapSamples_CIter_next(&it)) {
146+
traceback_t* tb = e->val;
147+
148+
PyObject* tb_and_size = PyTuple_New(2);
149+
PyTuple_SET_ITEM(tb_and_size, 0, traceback_to_tuple(tb));
150+
PyTuple_SET_ITEM(tb_and_size, 1, PyLong_FromSize_t(tb->size));
151+
PyList_SET_ITEM(heap_list, i, tb_and_size);
152+
i++;
153+
}
154+
return heap_list;
155+
}
156+
157+
void
158+
memalloc_heap_map_destructive_copy(memalloc_heap_map_t* dst, memalloc_heap_map_t* src)
159+
{
160+
HeapSamples_Iter it = HeapSamples_iter(&src->map);
161+
for (const HeapSamples_Entry* e = HeapSamples_Iter_get(&it); e != NULL; e = HeapSamples_Iter_next(&it)) {
162+
HeapSamples_insert(&dst->map, e);
163+
/* Erasing the element doesn't free up the backing storage. This is
164+
* probably fine; even at full capacity the table will probably only be
165+
* ~1-2MiB
166+
*/
167+
HeapSamples_erase_at(it);
168+
}
169+
}
170+
171+
void
172+
memalloc_heap_map_delete(memalloc_heap_map_t* m)
173+
{
174+
HeapSamples_CIter it = HeapSamples_citer(&m->map);
175+
for (const HeapSamples_Entry* e = HeapSamples_CIter_get(&it); e != NULL; e = HeapSamples_CIter_next(&it)) {
176+
traceback_free(e->val);
177+
}
178+
HeapSamples_destroy(&m->map);
179+
free(m);
180+
}

0 commit comments

Comments
 (0)