Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Use a principled, and consistent, implementation of freelists. #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markshannon opened this issue Sep 28, 2021 · 11 comments
Closed

Use a principled, and consistent, implementation of freelists. #89

markshannon opened this issue Sep 28, 2021 · 11 comments

Comments

@markshannon
Copy link
Member

We currently have many ad-hoc per class free-lists.

We should consider changing freelists from per-class to per-size.
My informal experiments show that currently about 50% of allocations are served from free lists, whereas with per-size freelists, this increased to about 90%.
With tuning, we should expect a hit rate well over 90%.

Using free lists for blocks of memory, instead of objects, means that there are no interactions between the free lists and the cycle GC, simplifying both. There is a small additional cost of re-assigning the class, but the cost of this will be negligible in most cases.

Design principles

  • Must support tracemalloc. That is, if tracemalloc is enabled, then no allocations should use free lists.
  • There should be one free list per size class.
  • The mapping from size to size class should be a small, pure function, ideally a fast one.
  • The mapping from size class to free list should be a small, pure function, ideally a fast one.
  • The functions for determining if a free list is full or empty should fast.
  • The procedure for pushing to and popping from a freelist should be a small and fast (this cannot be pure, since its purpose is a side effect).

With the above, allocation becomes:

void *allocate(size_t s)
{
    FreeList *list = size_class_to_free_list(size_to_size_class(s));
    if (!is_empty(list)) {
        /* fast path */
       return free_list_pop(list);
    }
    /* slow path */
    ...
}

The functions mentioned above must be small and pure so that:

  • If the input is a compile-time constant, then the whole expression is a compile-time constant.
  • If the input is not a compile-time constant, then the whole expression can be inlined if the compiler thinks it beneficial to do so.

In practice, we will probably inline most of the above function calls, it just helps to think of the parts separately when designing.

Implementation of the freelist.

One possible design of the free lists is:

Each free list consists of three elements.

  • A pointer to the head of the list (or NULL if empty)
  • An integer representing the maximum capacity. This should be determined empirically, but is likely to be larger for smaller sizes.
  • An integer representing the remaining space.

For alignment reasons the integers should be half the size of a pointer, which limits the capacity to 64k on a 32 bit machine, which is plenty.

The actual list is implemented as a linked list through the blocks of memory themselves, treating the first word of the memory as a pointer to the next entry.

  • tracemalloc support: By setting the head to NULL and the remaining space to zero, no allocations will be from the free list. Should tracemalloc be turned off, the remaining space can be set back to the capacity.
  • Functions for determining if the list is full or empty are fast: freelist->remaining == 0 and freelist->head == NULL, respectively.
  • The mapping from size class to free list can be made fast and simple, by placing the freelists in an array.
  • Pushing is simple enough, and with few dependent loads: new_entry->next = freelist->head; freelist->head = new_entry; freelist->remaining--;
  • Likewise popping: result = freelist->head; freelist->head = result->next; freelist->remaining++;

By putting the free lists in an array, the mapping from size class to free list is very simple.

Behavior when free list is full or empty; interaction with the malloc implementation.

This simplest option is to call malloc when the list is empty and to call free when the list is full.
Unfortunately this leads to fragmentation and performs poorly when repeatedly allocating or deallocating.

To avoid fragmentation, we should clear the free list when it is full and we need to free another object.
To avoid trashing on alternating malloc/frees we should half-fill the list when it is empty and we need to allocate an object.
Doing bulk free/mallocs reduces the number of free list misses considerably and keeps fragmentation down.

Ideally the malloc implementation would have functions for freeing and allocating multiple objects of the same size, which they generally don't. We could add that feature to PyMalloc.

@methane
Copy link

methane commented Sep 28, 2021

FWIW, I had tried similar idea but I couldn't optimize it enough.
methane/cpython#3 is the suspended branch.

@kmod
Copy link

kmod commented Sep 28, 2021

Could this be done as a feature of pymalloc? pymalloc already has a number of the features you're looking for, particularly mapping from size to size-class-specific metadata.

@markshannon
Copy link
Member Author

@methane how did it compare? As long as it was no slower, it would be worth it just for the reduction in code size.

@kmod, it is a bit arbitrary whether the free lists go inside the allocator or on top of it.
Putting it in the allocator makes sense, as the allocator knows the size of objects, so we can keep the interface to free().
However, we want to make small incremental improvements, so that would be a later step.

@ericsnowcurrently
Copy link
Collaborator

Cool idea!

A quick look at Include/internal/pycore_interp.h shows the following freelists:

  • PyFloatObject
  • PyTupleObject
  • PyListObject
  • PyDictObject
  • struct _PyAsyncGenWrappedValue
  • struct PyAsyncGenASend
  • PyContext
  • PyBaseExceptionObject

So all of these would change to use the new mechanism, right? What is the likelihood that any of these share a size?

Other questions:

  • What fundamental improvement is this over ditching freelists completely and using the regular allocator? Avoiding fragmentation for a common case?
  • How do we decide the best size for freelists? (from what I understand, the current sizes have been finely tuned)
  • What would be the criteria for whether or not a type should use a freelist?
  • What change in performance might we see when the number of types in a size class (ergo sharing a freelist) changes?
  • How does this impact runtime finalization?
  • Subclasses won't use freelists, right?

@methane
Copy link

methane commented Oct 20, 2021

I updated my legacy global-freelist with current main branch.
See python/cpython#29089

@tiran
Copy link

tiran commented Oct 20, 2021

Inspired from Sam's nogil effort, I started to play with https://github.com/microsoft/mimalloc as replacement for libc's malloc and pymalloc. It comes with builtin freelist support and freelist sharding. If we move to to mimalloc, then we get lock-free freelists for free (three times free for free)!

Pablo ran benchmarks for me:

The first case is slightly slower than vanilla Python. It is a surprisingly good result for a first attempt without any tuning.

@serhiy-storchaka
Copy link

Is not it already done at low level in obmalloc?

@methane
Copy link

methane commented Oct 20, 2021

Yes. obmalloc has "freelist" in pool. It is very similar to "freelist" in mimalloc.

    block *freeblock;                   /* pool's free list head         */

So "mimalloc has freelist" is not enough reason to say "Python don't need freelist before malloc/free".
If malloc/free in mimalloc is very fast, "freelist in Python" will be just overhead.

@markshannon
Copy link
Member Author

markshannon commented Oct 22, 2021

Prototype version of a more principled approach to free lists shows a 1%speed up

@markshannon
Copy link
Member Author

Extending this beyond ints and floats is a pain.
The fundamental problem is the C model of malloc and free. The memory allocator must know the size of a memory block, but there is no way to get at that information, which is a bit frustrating.

Adopting an allocator, such as mimalloc, and exposing the freelists would be an alternative way to do this. The reason that would boost performance, is the ability to inline allocations and frees. We don't need to worry about race conditions thanks to the GIL.

@JunyiXie
Copy link

Inspired from Sam's nogil effort, I started to play with https://github.com/microsoft/mimalloc as replacement for libc's malloc and pymalloc. It comes with builtin freelist support and freelist sharding. If we move to to mimalloc, then we get lock-free freelists for free (three times free for free)!

Pablo ran benchmarks for me:

The first case is slightly slower than vanilla Python. It is a surprisingly good result for a first attempt without any tuning.

How much impact that replace pymalloc with mimalloc on the binary size?

@faster-cpython faster-cpython locked and limited conversation to collaborators Dec 2, 2021
@gramster gramster moved this to Todo in Fancy CPython Board Jan 10, 2022
@gramster gramster moved this from Todo to Other in Fancy CPython Board Jan 10, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants