Deferred reference counts.

CPython uses a combination of naive reference counting and a stack machine.
This results in a *lot* of reference counting operations that have a major impact on the performance of the JIT and interpreter.

We can reduce the cost of reference counting hugely by using deferred reference counting.

With deferred reference counting, we do not explicitly count references from the active part of the frame stack (in both local variables and the evaluation stack). Instead, we only count those references during GC.

Something like 70% of reference count operations occur in the interpreter. This fraction will vary, but if we start moving code from C to Python for performance and maintainability reasons, it will likely increase.

In order to implement deferred reference counting, we will need new C types for the stack references, [some work on this](https://github.com/python/cpython/pull/118330) has already been done.


### Reference counts

Currently we maintain an exact reference count for each object in that object's header. This is the *counted* reference count:
`counted RC == true RC`
With deferred reference counts some references will not be counted in the object's header, but will be *deferred*:
`counted RC + deferred RC == true RC`
Note that, since no reference count can be negative, the true reference count can only be zero when both the deferred RC and the counted RC are zero.

This mean that we need a way to keep a handle to all objects whose counted RC reaches zero, so that we can collect them when we know that all deferred RCs are zero.
We will probably implement this using what is known as a Zero Count Table (ZCT). Any object whose counted RC reaches zero, or is assigned to a stack variable when created will be added to the ZCT.
During GC, when all deferred RCs are zero, any objects in the ZCT with a RC of zero can be reclaimed.

### Prompt reclamation.
One important advantage of reference counting is prompt reclamation; there is not a long pause between an object becoming unreachable and it getting reclaimed.
 
In practice, it is not prompt reclamation of objects that matters but prompt reclamation of resources, primarily memory.
Rather than tracking the number of objects allocated to determine when GC should occur, we should track the total memory allocated.
If we are allocating multi-megabyte objects, GC should occur very frequently.
 
 ### The algorithm
 
 #### Allocation
 
 Upon allocation of an object:
 * Decrease the nursery size by the size of the object
 * If the nursery size drops to zero or below, then schedule GC
 * If the reference is to be deferred (such that the counted RC is zero) then add the new object to the ZCT

#### GC

A GC collection should do the following:
1. Make all deferred references immediate by scanning all stacks and any generator/coroutines with deferred references.
2. Deallocate all object in the ZCT table with a RC of zero.
3. Perform incremental GC as normal, removing any collected objects from the ZCT.
4. Set the nursery size back to maximum.

#### Details

* Any thread that calls into C code or suspends execution must store the stack pointer so that the GC can scan the stack.
* It must be possible to add objects to, and remove objects from, the ZCT very quickly.

#### Handling references

We will need to distinguish between heap references and stack references.
Stack references to an object contribute to the deferred RC of that object.
Heap references to an object contribute to the counted RC of that object.

For all code except the GC, we can use the C compiler to help us out.
We can define two opaque structs for stack and heap references

```C
/* Stack reference */
typedef struct _stackref  PySRef;

/* Heap reference */
typedef struct _heapref PyHRef;
```

Because we cannot cast a `PyHRef` to a `PySRef` or vice-versa, we must do through function calls
(provided we are disciplined enough to not extract the bits directly).

The function calls can correctly adjust the reference counts, putting objects in the ZCT if necessary.
In debug mode we can use a bit as a dynamic marker to double-check that we are doing things correctly.

All interpreter frames, including generator and coroutine frames, must contain all deferred RCs or all immediate RCs. Executing frames must contain deferred RCs. The frame will need a bit to indicate whether references are deferred or immediate.

### Lowering the cost of stack scanning

At every GC pause, which might be frequent, we need to make sure that the deferred reference count of all objects is zero.
To do that we need to scan the entirety of all stacks. That could be expensive.
To avoid having to scan the entire stack, we treat the lower part (furthest way from the current frame) as heap references and the top part (including the current frame) as stack references. The current frame must always be deferred, so we insert a special frame between the heap and stack parts to convert when a `RETURN` or `YIELD` would otherwise start executing a frame of heap references.

As we can rely on the special frame to convert from heap to stack references lazily, we don't need to perform that conversion in the GC.
The GC does need to convert stack to heap references, so at the end of a GC collection, all thread stacks will have a special frame as their current frame

### Structures

In order to get the most help from the C compiler converting plain `PyObject *` pointers into references suitable for supporting deferred reference counting and tagged ints, we need to define some opaque structs:

We should use the dup/close model of HPy for references, as it allows automatic finding of refcount errors which will be a boon for development.

#### The API

```C
inline PyObject *PyHRef_To_PyObject_New(PyHRef);

static inline PyObject *PyHRef_To_PyObject_Steal(PyHRef href) {
    PyObject *res = PyHRef_To_PyObject_New(href);
    PyHRef_Close(href);
}

inline PyHRef PyHRef_Dup(PyHRef);
inline void PyHRef_Close(PyHRef);

inline PyObject *PySRef_To_PyObject_New(PySRef);

static inline PyObject *PySRef_To_PyObject_Steal(PySRef sref) {
    PyObject *res = PySRef_To_PyObject_New(sref);
    PySRef_Close(sref);
}

inline PySRef PySRef_Clone(PySRef);
inline void PySRef_Close(PySRef);

inline PySRef PyObject_To_SRef_New(PyObject *);

inline PySRef PyObject_To_SRef_Steal(PyObject *obj) {
    PySRef res = PyObject_To_SRef_New(obj);
    Py_DECREF(obj);
    return res;
}

inline PyHRef PyObject_To_HRef_New(PyObject *);

inline PyHRef PyObject_To_HRef_Steal(PyObject *) {
    PyHRef res = PyObject_To_HRef_New(obj);
    Py_DECREF(obj);
    return res;
}
```
All the `Steal` variants are defined as above, but may be implemented more efficiently.

### Generators and coroutines

Generators and coroutines are executable objects containing references to other objects, like frames, but they are not on the frame stack; they are on the heap.

Here are three ways we can handle this:
1. All references in the generator's frame are stack references.
    This keeps execution and the compiler simple as generator frames are no different to normal frame. We do, however, need to track all generators so that the GC can convert the references when needed. With many live coroutines, this could get expensive.
2. All references in the generator's frame are heap references.
    This requires us to duplicate all instructions, which would be infeasible, or have the compiler insert additional reference count instructions where needed. With reasonable static analysis, this could perform reasonably well, but adds a lot of complexity to the compiler
3. All local variables are heap references, but all evaluation stack entries are stack references. This requires a special version of `STORE_FAST`, but other instructions are unchanged. The compiler will need to either, make sure the stack is empty when suspended, *or* make sure that all values on the stack are also present in a (counted) local variable. The latter is probably more efficient.

Of these, I think 3. is the best. It is reasonably efficient, and can be implemented and tested prior to the rest of the work on deferred references.

### Tagging

Since references on the stack can be either stack references or heap references, it will be relatively easy to get the type wrong.
To help prevent that we can tag references.
In theory no tags should be necessary, and we could just do this for debug builds. However, if we are to support tagged ints, we will need tagging anyway, so we might as well use it for all code.

For consistency with tagged ints, we want 0 to mean "not a heap reference", i.e. stack reference (or tagged int), leading to the following tagging scheme:

 Tag | Meaning
--- | --- 
00 | Reserved
01 | Stack reference pointer (value = ptr+1)
10 | Reserved
11 | Heap reference pointer (value = ptr-1)

With tagged ints, we get this scheme:
 Tag | Meaning
--- | --- 
00 | Unboxed int
01 | Stack reference pointer (value = ptr+1)
10 | Reserved
11 | Heap reference pointer (value = ptr-1)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deferred reference counts. #677

Reference counts

Prompt reclamation.

The algorithm

Allocation

GC

Details

Handling references

Lowering the cost of stack scanning

Structures

The API

Generators and coroutines

Tagging

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tag	Meaning
00	Reserved
01	Stack reference pointer (value = ptr+1)
10	Reserved
11	Heap reference pointer (value = ptr-1)

Tag	Meaning
00	Unboxed int
01	Stack reference pointer (value = ptr+1)
10	Reserved
11	Heap reference pointer (value = ptr-1)

Deferred reference counts. #677

Description

Reference counts

Prompt reclamation.

The algorithm

Allocation

GC

Details

Handling references

Lowering the cost of stack scanning

Structures

The API

Generators and coroutines

Tagging

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions