Skip to content

Commit 3623aaa

Browse files
authored
Add file describing how to add or modify specialized families of instructions. (GH-26954)
1 parent dd3adc0 commit 3623aaa

File tree

2 files changed

+135
-0
lines changed

2 files changed

+135
-0
lines changed

Python/adaptive.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Adding or extending a family of adaptive instructions.
2+
3+
## Families of instructions
4+
5+
The core part of PEP 659 (specializing adaptive interpreter) is the families
6+
of instructions that perform the adaptive specialization.
7+
8+
A family of instructions has the following fundamental properties:
9+
10+
* It corresponds to a single instruction in the code
11+
generated by the bytecode compiler.
12+
* It has a single adaptive instruction that records an execution count and,
13+
at regular intervals, attempts to specialize itself. If not specializing,
14+
it executes the non-adaptive instruction.
15+
* It has at least one specialized form of the instruction that is tailored
16+
for a particular value or set of values at runtime.
17+
* All members of the family have access to the same number of cache entries.
18+
Individual family members do not need to use all of the entries.
19+
20+
The current implementation also requires the following,
21+
although these are not fundamental and may change:
22+
23+
* If a family uses one or more entries, then the first entry must be a
24+
`_PyAdaptiveEntry` entry.
25+
* If a family uses no cache entries, then the `oparg` is used as the
26+
counter for the adaptive instruction.
27+
* All instruction names should start with the name of the non-adaptive
28+
instruction.
29+
* The adaptive instruction should end in `_ADAPTIVE`.
30+
* Specialized forms should have names describing their specialization.
31+
32+
## Example family
33+
34+
The `LOAD_GLOBAL` instruction (in Python/ceval.c) already has an adaptive
35+
family that serves as a relatively simple example.
36+
37+
The `LOAD_GLOBAL_ADAPTIVE` instruction performs adaptive specialization,
38+
calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
39+
40+
There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
41+
which is specialized for global variables in the module, and
42+
`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
43+
44+
## Performance analysis
45+
46+
The benefit of a specialization can be assessed with the following formula:
47+
`Tbase/Tadaptive`.
48+
49+
Where `Tbase` is the mean time to execute the base instruction,
50+
and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
51+
52+
`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
53+
54+
`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
55+
the number of times that instruction is executed.
56+
`Tmiss` is the time to process a miss, including de-optimzation
57+
and the time to execute the base instruction.
58+
59+
The ideal situation is where misses are rare and the specialized
60+
forms are much faster than the base instruction.
61+
`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
62+
In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
63+
Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
64+
`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
65+
we would expect the specialization of `LOAD_GLOBAL` to be profitable.
66+
67+
## Design considerations
68+
69+
While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
70+
`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
71+
low for all specialized instructions and `Nmiss` as low as possible.
72+
73+
Keeping `Nmiss` low means that there should be specializations for almost
74+
all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
75+
keeping `Ti` low which means minimizing branches and dependent memory
76+
accesses (pointer chasing). These two objectives may be in conflict,
77+
requiring judgement and experimentation to design the family of instructions.
78+
79+
### Gathering data
80+
81+
Before choosing how to specialize an instruction, it is important to gather
82+
some data. What are the patterns of usage of the base instruction?
83+
Data can best be gathered by instrumenting the interpreter. Since a
84+
specialization function and adaptive instruction are going to be required,
85+
instrumentation can most easily be added in the specialization function.
86+
87+
### Choice of specializations
88+
89+
The performance of the specializing adaptive interpreter relies on the
90+
quality of specialization and keeping the overhead of specialization low.
91+
92+
Specialized instructions must be fast. In order to be fast,
93+
specialized instructions should be tailored for a particular
94+
set of values that allows them to:
95+
1. Verify that incoming value is part of that set with low overhead.
96+
2. Perform the operation quickly.
97+
98+
This requires that the set of values is chosen such that membership can be
99+
tested quickly and that membership is sufficient to allow the operation to
100+
performed quickly.
101+
102+
For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
103+
dictionaries that have a keys with the expected version.
104+
105+
This can be tested quickly:
106+
* `globals->keys->dk_version == expected_version`
107+
108+
and the operation can be performed quickly:
109+
* `value = globals->keys->entries[index].value`.
110+
111+
Because it is impossible to measure the performance of an instruction without
112+
also measuring unrelated factors, the assessment of the quality of a
113+
specialization will require some judgement.
114+
115+
As a general rule, specialized instructions should be much faster than the
116+
base instruction.
117+
118+
### Implementation of specialized instructions
119+
120+
In general, specialized instructions should be implemented in two parts:
121+
1. A sequence of guards, each of the form
122+
`DEOPT_IF(guard-condition-is-false, BASE_NAME)`,
123+
followed by a `record_cache_hit()`.
124+
2. The operation, which should ideally have no branches and
125+
a minimum number of dependent memory accesses.
126+
127+
In practice, the parts may overlap, as data required for guards
128+
can be re-used in the operation.
129+
130+
If there are branches in the operation, then consider further specialization
131+
to eliminate the branches.

Python/specialize.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@
77
#include "opcode.h"
88
#include "structmember.h" // struct PyMemberDef, T_OFFSET_EX
99

10+
/* For guidance on adding or extending families of instructions see
11+
* ./adaptive.md
12+
*/
13+
1014

1115
/* We layout the quickened data as a bi-directional array:
1216
* Instructions upwards, cache entries downwards.

0 commit comments

Comments
 (0)